7,171 Matching Annotations
  1. Last 7 days
    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      In this study, we mechanistically define a new molecular interaction linking two of the cell's major morphological regulatory pathways-the Rho GTPase and Hippo signaling networks. These two major signaling pathways are both required for life across huge swaths of the tree of life. They are required for the dynamic organization and reorganization of proteins, lipids, and genetic material that occurs in essential cellular processes such as division, motility and differentiation. For decades these pathways have been almost exclusively studied independently, however, they are known to act in concert in cancer to drive cytoskeletal remodeling and morphological changes that promote proliferation and metastasis. However, mechanistic insight into how they are coordinated is lacking.

      Our data reveal a mechanistic model where coordination is mediated by the RhoA GTPase-activating protein ARHGAP18, which forms molecular interactions with both the tumor suppressor Merlin (NF2) and the transcriptional co-regulator YAP (YAP1). Using a combination of state-of-the-art super-resolution microscopy (STORM, SORA-confocal) in cultured human cells, biochemical pulldown assays with purified proteins, and analyses of tissue-derived samples, we characterize ARHGAP18's function from the molecular to the tissue level in both native and cancer model systems.

      Together, these findings establish a previously unrecognized molecular connection between the RhoA and Hippo pathways and culminate in a working model that integrates our current results with prior work from our group and decades of prior studies. This model provides a new conceptual framework for understanding how RhoA and Hippo signaling are coordinated to regulate cell morphology and tumor progression in human cells.

      In this substantially revised manuscript, we have addressed all comments from the expert reviewers described point-by-point below. A shared major comment from the reviewers was the request for direct evidence of the proposed mechanistic model. To address these constructive comments, we've added new experiments, new quantification, new text, new control data, and have added two expert authors, adding super-resolution mouse tissue imaging data for the endogenous study of ARHGAP18 in its native condition. We believe that these additions greatly enhance the manuscript and collectively address the overall message from the reviewer's collective comments.

      2. Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This manuscript describes a dual mechanism by which ARHGAP18 regulates the actin cytoskeleton. The authors propose that in addition to the known role for ARHGAP18 in regulating Rho GTPases, it also affects the cytoskeleton through regulation of the Hippo pathway transcriptional regulator YAP. ARHGAP18 knockout Jeg3 cells are were generated and show a clear loss of basal stress fiber like F-actin bundles. The authors further characterize the effects of ARHGAP18 knockout and overexpression. It is also discovered that ARHGAP18 binds to the Hippo pathway regulator Merlin and to YAP. Ultimately it is concluded that ARHGAP18 regulates the F-actin cytoskeleton through dual regulation of RHO GTPases and of YAP. While the phenotype of the ARHGAP18 knockout and the association of ARHGAP18 with Merlin and YAP is interesting, I found the authors conclusion that these phenotypes are due to ARHGAP18 regulation of both RHO and YAP to be based on largely correlative evidence and sometimes lacking in controls or tests for significance. In addition the authors often make overly strong conclusions based on the experimental evidence. In some instances, the rationale for how the experimental results support the conclusion is insufficiently articulated, making evaluation challenging. In general although the authors have some interesting observations, more definitive experiments with proper controls and statistical tests for significance and reproducibility are needed to justify their overall conclusions.

      • *

      *We appreciate the reviewers' constructive comments and have added substantial new data and quantifications to address their concerns. We have focused these new data on directly testing the proposed mechanisms, adding controls, and performing quantitative analysis with statistical testing. Additionally, we have edited our language to make our rationale clearer and to present our conclusions as a more moderate assessment of our experimental results. Below we respond to the specific comments made by the reviewer, followed by a list of additional editorial changes we've made based on the reviewer's overarching comments on clarity and rationale. *

      Specific Comments

      1) The authors make a big point about the effects of ARHGAP18 on myosin light chain phosphorylation. However, this result is not quantified and tested for statistical significance and reproducibility.

      *We thank the reviewer for their comments on our western blotting quantification, which in the original submission version had quantification of RhoA downstream signaling of pCofilin/ Cofilin and pLIMK/ LIMK. We had withheld the pMLC and MLC quantification as the result was previously published with quantification, reproducibility, and statistical significance by our group in our prior manuscript on ARHGAP18 published in Elife in 2024 (Fig. 4E of *

      https://doi.org/10.7554/eLife.83526 ). However, these prior results lacked the new overexpression data. We recognize the need to add these data to this manuscript as requested by the reviewer.

      • *

      *To address the reviewer's comment, we have added quantification of pMLC/MLC (Fig. 1F) *

      2) Along similar lines in Figure 2C they state that overexpression of ARHGAP18 causes cells to invade over the top of their neighbors. This might be true and interesting, but only a single cell is shown and there is no quantification or controls for simply overexpressing something in that cell. The authors also conclude from this image that the overexpression phenotype is independent of its GAP activity on Rho. It is not clear how this conclusion is made based on the data. It would seem like a more definitive experiment would be to see if a similar phenotype was induced by an ARHGAP18 mutant deficient in GAP activity.

      Based on the reviewer's comment, we recognize the qualitative statements made in Figure 2C (now Figure 3) should've been made more quantitative. We have added the control of Jeg 3 WT cells expressed with empty vector flag to show that WT cells do not invade over the top of each other (Fig. 3F). Additionally, we have added the quantification found in Fig. 3E, which shows the % invasive/ non-invasive cells between WT and ARHGAP18 overexpression cells. We have clarified our conclusions to make clear that these data do not directly test if the invasive phenotype derives from a Rho-independent mechanism. The text now states the following conclusion alongside others, which can be seen in our tracked changes:

      • *

      "These data support the conclusion that ARHGAP18 acts to regulate basal and junctional actin. However, it was not clear whether this activity occurred through a Rho-independent or a Rho-dependent mechanism."

      • *

      We have added new data of cells expressing an ARHGAP18 mutant deficient in GAP activity, which is explained in detail in the following response below.

      3) In Figure 3 the authors compare gene expression profiles of ARHGAP18 knockout cells to wild-type cells. They see lots of differences in focal adhesion and cytoskeletal proteins and conclude that this supports their conclusion that ARHGAP18 is not just acting through RHO. The rationale for this in not clear. In addition, they observe changes in expression profiles consistent with changes in YAP activity. They conclude that the effects are direct. This very well might be true. However RHO is a potent regulator of YAP activity and the results seem quite consistent with ARHGAP18 acting through RHO to affect YAP.

      • *

      We thank the reviewer for their comment and believe the revised manuscript now presents direct evidence to support the conclusions made through the editing text and the incorporation of new data.

      • *

      First, the reviewer highlighted that we were not clear in our rationale and explanation of the conclusions made from our RNAseq data in the new Figure 4 (Previously Figure 3). We agree with the reviewer that the RNAseq data alone is not sufficient rationale for the conclusion that ARHGAP18 is acting through YAP directly. In the revised manuscript, the conclusion is now made based on the combination of our multi-faceted investigation of the relationship between ARHGAP18 and YAP (most importantly, new Figure 5). It's important for us to argue that our RNAseq analysis is much more robust and specific than simply reporting a descriptive assay seeing lots of differences in cytoskeletal proteins. We recruited an outside RNAseq expert collaborator; Dr. Yongho Bae, to perform state-of-the-art IPA analysis and a grueling manual curation of the top hit genes to identify the predominant signaling pathways linking the loss of ARHGAP18 to known YAP translational products. We've provided a supplemental table listing each citation supporting the identified YAP pathway associations from this manual curation. We also have added a new discussion paragraph on RNAseq data to clarify our specific RNAseq data results and analysis. In the revised manuscript, we have moderated our language in the results text regarding the RNAseq data to reflect the reviewer's suggestion:

      • *

      "Our RNAseq data alone could not independently confirm if the alterations to transcriptional signaling and expression of actin cytoskeleton proteins were through a Rho-dependent or Rho-independent mechanism."

      • *

      • *

      Second, in this comment and the above, the reviewer highlights the need for a new experiment to directly test the Rho Independent effects of ARHGAP18, which we now provide in the new Figure 5. In this new data, we've applied an experimental design suggested by reviewer 2 regarding the same concern. In short, we've produced and expressed a point mutant variant ARHGAP18(R365A), which abolishes the Rho GAP activity while maintaining the remainder of the protein intact. This construct allows us to directly test the effects of ARHGAP18 independent from its RhoA GAP activity. We find that the GAP-deficient ARHGAP18 is able to fully rescue basal focal adhesions, indicating that the basal actin phenotype is at least in part regulated through a Rho-independent mechanism.

      • *

      • *

      *We believe the revised manuscript, when taken in totality, provides the definitive proof requested by the reviewer. Specifically, the combination of Figure 5, where we show new data using the ARHGAP18(R365A) variant, and the result that ARHGAP18 forms a stable complex with YAP (Fig. 6G) or Merlin (Fig.6A), is supportive of direct Rho-independent molecular interactions between YAP, Merlin, and ARHGAP18. *

      4) In Figure 4A showing Merlin binding to ARHGAP18 there is no control for the amount of Merlin sticking to the column as was done in Figure 4F for binding experiments with YAP. This makes it difficult to determine the significance of the observed binding.

      We have performed the requested control experiment and added the results to Figure 6A.

      5) The images in Figure 4C showing YAP being maintained in the nucleus more in ARHGAP18 knockout cells compared to wild-type. However the images only show a few cells and YAP localization can be highly variable depending on where you look in a field. Images with more cells and some sort of quantification would bolster this result.

      We have provided quantification (Figure 6D) of what was originally Figure 4C (now Figure 6C).

      Reviewer #1 (Significance (Required)):

      While the phenotype of the ARHGAP18 knockout and the association of ARHGAP18 with Merlin and YAP is interesting, I found the authors conclusion that these phenotypes are due to ARHGAP18 regulation of both RHO and YAP to be based on largely correlative evidence and sometimes lacking in controls or tests for significance. In addition the authors often make overly strong conclusions based on the experimental evidence. In some instances, the rationale for how the experimental results support the conclusion is insufficiently articulated, making evaluation challenging. In general although the authors have some interesting observations, more definitive experiments with proper controls and statistical tests for significance and reproducibility are needed to justify their overall conclusions.

      In the above comments, we detail the specific definitive experiments, proper controls, and statistical tests for significance, requested by the reviewer, which we believe greatly strengthen our manuscript.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      This manuscript investigates the Rho effector, ARHGAP18 in Jegs cells, a trophoblastic cell line. It presents a number of new pieces of data, which increase our understanding of the importance of this GAP on cell function and explains at a molecular level previous results of other workers in the field. ARHGAP18 was originally given the name "conundrum' and continues to stand apart from the majority of other GAP proteins and their functions. Hence the data here is significant and of high standard.

      The data is clear, and the images are of high quality and extremely impressive in their resolution. It is significant and adds a further layer to our understanding of the regulation of cell migration, particularly in the formation and resolution of microvilli.

      • *

      We appreciate the reviewer's comments and supportive insights.

      The data is based on the use of the cell line Jeg3. Even the authors previous publication in eLife is based only on this cell line. They need to show the conclusions are general and not specific to this line of cells. As an extension of this, is the ARHGAP18 function shown here only in transformed cells? Does the same mechanisms operate in normal cells, which respond to activation to proliferate or migrate?

      • *
      • We respectfully point out that the critical experiments of the prior eLife publication were validated in DLD-1 colorectal cells and not Jeg-3 cells alone (Figure 1-figure supplement 2). Our newly independent lab, established just over a year ago, is unable to perform a full expansion of the manuscript using untransformed cells, however, we agree with the reviewer's perspective and wish to address the comment to the best of our current capability. To answer the reviewers' suggestions, we have recruited Dr. Christine Schaner Tooley, an expert in mouse model system studies. In the revised manuscript, we've added new Super-Resolution SORA confocal images of endogenous ARHGAP18's localization in the intact intestinal villi tissue, and apical junctions of WT mice (Fig.1A-C). These data indicate that endogenous ARHGAP18 is enriched (but not exclusively localized) at the apical plasma membranes of normal WT epithelial cells. This localization, where both Merlin and Ezrin are present at apical membrane/ junctions under normal conditions, is a major component of the working model proposed in Fig. 7. These data also indicate that ARHGAP18 is capable of entering the nucleus in WT cells, another critical aspect of our proposed model. Collectively, our DLD-1 studies published previously and or new studies using WT mice tissue samples support the conclusion that at least some of ARHGAP18's functions described in this manuscript are not limited to Jeg3 cells.*

      In endothelial cells, Lovelace et al 2017 showed localization to microtubules and that depletion of ARHGAP18 resulted in microtubule instability. The authors may like to comment on the differences. Is this a cell type difference or RhoA versus RhoC difference?

      • *

      In our previous publication (Lombardo Elife), we validated the finding that ARHGAP18 forms a complex with microtubules, as we detected tubulin in the ARHGAP18 pulldown experiment (Figure 1- Source Data). However, our data indicate that in Jeg3 cells ARHGAP18 does not localize to the same microtubule associated spheres observed in the Lovelace publication. We now comment on the shared conclusions and differences between this manuscript and the Lovelace et al 2017 in the discussion section.

      • *

      "In endothelial cells, ARHGAP18 has been reported to localize microtubules and plays a role in maintaining proper microtubule stability (Lovelace et al., 2017). In our epithelial cell culture models and WT mouse intestine, we have been unable to detect ARHGAP18 at microtubules suggesting ARHGAP18 may have additional functions is various cell types."

      On pages 7,9 they conclude that MLC and basal and junctional actin are regulated through a GAP independent mechanism. The best way to show this is with overexpression of a GAP mutant.

      We appreciate the reviewer's insight and have produced and expressed a GAP mutant, ARHGAP18(R365A), in our cells, directly testing our conclusion that ARHGAP18 has a GAP-independent function. These data are now presented in revised Figure 5 and explained further in response to reviewer #1.

      There is a huge amount of data presented in Figure 3, but their 2 genes which they focus on, LOP1 and CORO1A, are discussed but no actual data presented in support.

      We now validate the CORO1A by qPCR in Figure 4J.

      • *

      Reviewer #2 (Significance (Required)):

      The data is significant and adds a further layer to our understanding of the regulation of cell migration, particularly in the formation and resolution of microvilli. This manuscript will be of significance to an basic science audience in the field of RhoGTPases and cell migration.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The study by Murray et al explores the effects of ARHGAP18 on the actin cytoskeleton, Rho effector kinases, non-muscle myosin, and transcription. Using super resolution microscopy, they show that in ARHGAP18 KO cells there is a mixed and unexpected cytoskeleton phenotype where myosin phosphorylation appears to be increased, but actin is disorganised with reduced stress fibres, diminished focal adhesions and augmented invasiveness. They conclude that the underlying mechanisms are likely independent from RhoA. Next, they perform RNAseq using the KO cells and identify an array of dysregulated genes, including those that play crucial roles in microvilli (related to previously published findings). Analysis of the data identify gene expression changes that are relevant for altered focal adhesion (integrins). Further analysis reveals that a large cohort of the dysregulated genes are YAP targets. They then show that in ARHGAP18 KO cells YAP nuclear localization, as detected by immunostaining, is augmented; and demonstrate that immobilized ARHGAP18 protein can bind the Hippo regulator merlin as well as YAP itself.

      Major comments:

      1, The premise of the study (that ARHGAP18 is a RhoA effector or may acts independently of RhoA) remains not proven.

      We have added new evidence of direct RhoA independent activity for ARHGAP18 described in the above comments. Specifically, we've added data using a RhoA-GAP dead variant of ARHGAP18 in Figure 5, which we believe addresses this comment.

      • *

      At several places (including in the title) the authors refer to ARHGAP18 as a Rho effector, which would suggest that it is downstream form Rho, but the basis for this is not clear. In fact, their own previous study suggested that ARHGAP is a RhoA regulator, rather than an effector. In general, the connection of the described effects to RhoA remains unclear, and not addressed in this study. The authors seem to go back and forth in their conclusions regarding the connection between ARHGAP18 and RhoA. For example, the first section of results is finished by stating (line 194): "These data support the conclusion that ARHGAP18 acts to regulate basal and junctional actin through Rho-independent mechanism". But the next section starts by stating (line 198): "We hypothesized that the invasive and cytoskeletal phenotypes observed at the basal surface of cells devoid of ARHGAP18 may be a result of changes in regulation at the transcriptional level either directly through RhoA signaling or through an additional mechanism specific to ARHGAP18". The paper would be strengthened by adding data that show whether the effects are indeed downstream, from RhoA or RhoA independent. If there is no sufficient demonstration that ARHGAP18 is downstream of RhoA and is an effector, this needs to be stated explicitly, and the wording should be changed.

      *We now provide new data in Figure 5, which directly tests the RhoA independent functions of ARHGAP18 as recommended by the reviewer. Our understanding of the term effector is 'a molecule that activates, controls, or inactivates a process or action.' Based on this understanding, we used the term to convey ARHGAP18's functional role within the feedback loop, rather than to imply that it acts exclusively downstream. *

      • *

      We seek to clarify our perspective with the reviewer's assertion that we go "back and forth" as to if ARHGAP18 functions in a Rho Dependent or Rho Independent manner. It was our intent to propose a model where ARHGAP 18 acts in two separate circuits that regulate cell signaling. The first circuit involves ARHGAP18's canonical RhoA GAP activity, which involves ERMs and LOK/SLK, and is limited to the apical plasma membrane. This first signaling circuit was characterized in our prior Elife manuscript (Lombardo et al., 2024) and in an earlier JCB manuscript (Zaman and Lombardo et al., 2021). In this newly revised manuscript, we provide a partial mechanistic characterization of the second circuit, which we freely admit is much more complex and will likely require additional study to fully characterize.

      • *

      As both circuits operate as signaling feedback loops, we find the terms 'upstream' and 'downstream' to be of limited value, and we attempt to avoid their use when possible. We retain their use only when referring to the Hippo and ROCK signaling cascades, where these designations are well established. We suggest that the conceptual inconsistencies of Conundrum/ARHGAP18 may have arisen from the tendency to view it in strictly binary terms as upstream or downstream. Here, we propose a third possibility that ARHGAP18 functions as both, participating in a negative feedback loop.

      • *

      *We have edited and added data testing if the effects are Rho independent and discussion text in response to the reviewer's comments and clarify the molecular function of ARHGAP18.

      "Additionally, focal adhesions and basal actin bundles are restored to WT levels when the ARHGAP18(R365A) GAP-ablated mutant is expressed in ARHGAP18 KO cells (Fig. 5A, B). These results represent the strongest argument that ARHGAP18 functions in additional pathways to RhoA/C alone. Our data suggests that at least one of the alternative pathways is through ARHGAP18's interaction with YAP and Merlin. From these data we conclude that ARHGAP18 has important functions in both RhoA signaling through both its GAP activity and in Hippo signaling through its GAP independent binding partners. "*

      • *

      • *

      The study is descriptive and contains a series of observations that are not connected. Because of this, the study's conclusions are not well supported, and key mechanistic insight is limited. The study feels like a set of separate observations, that remain incompletely worked out and have some preliminary feel to them. The model in the last figure also seems to contain hypotheses based on the observations, several of which remains to be proven.

      • *

      *We present our revised manuscript, in which we've more clearly outlined our rationale and conclusions, as detailed in the above responses, to emphasize the overall connectivity of the study. We have also updated the title of Figure 7 to read "__Theoretical __Model of ARHGAP18's coordination of RhoA and Hippo signaling pathways in Human epithelial cells." To make it clear that we are presenting a working model, which has elements that will require additional investigation. Throughout the manuscript, we highlight the unknown elements that remain to be tested or other outstanding questions. Thus, we do not aim to characterize this complex signaling coordination completely. Instead, this manuscript represents the 3rd iteration in our systematic advances to describe this entirely new signaling pathway. We agree that, despite three separate manuscripts (this one included) to date, this work represents an early stage in understanding the system, many additional studies will be needed to characterize this signaling system fully. Figure 7 is presented as a working model that results from a thoughtful combination of our collective data and that of other researchers, derived from numerous species across decades of study. We firmly believe that proposing such integrative models is valuable for advancing the field. We also recognize the importance of clearly indicating which aspects remain hypothetical. We now explicitly note in several places within the discussion which components of the model will require further validation and experimental confirmation. For example, regarding our theoretical mechanism in Figure 7 we state: *

      "Validation of the direct mechanism by which YAP/TAZ transcriptional changes drive basal actin changes in ARHGAP18 KO cells will require further investigation based on predictions from RNAseq results."

      • *

      Addressing any possible connection between key effects of ARHGAP18 KO (changes in actin, focal adhesion, integrins, Yap and merlin binding) could strengthen the manuscript. One such specific question is the whether the changes in integrin expression (RNAseq) are indeed connected to the actin alterations and reduction ion focal adhesions (Fig 1). Staining for these integrins to show they are indeed altered, and/or manipulating any of them to reproduce changes could provide and exciting addition.

      • *

      *We attempted to stain cells for Integrins by purchasing three separate antibodies. However, despite extensive optimization and careful selection of the specific integrins using our RNAseq results we were unable to get any of these antibodies to work in any cell type or condition. We believe that there is a technical challenge to staining for integrins due to their transmembrane and extracellular components, which we were unable to overcome. As an attempt to address the reviewers comment, we alternatively stained cells for paxillin which directly binds the cytoplasmic tails of integrins (Fig. 3&5). *

      Some of the experimental findings are not convincing or lack controls. Fig 1: some of the western blots are not convincing or poor quality. [...] On the same figure, the quality of LIM kinase blots is poor. [...] The signal is weak, and the blot does not appear to support the quantification. The last condition (expression of flag-ARHGAP18) results in a large drop in pLIMK and pcofilin on the blot, which is not reflected by the graph. Addition of *a better blot and the use of strong positive or negative control would boost confidence in these data. *

      • *

      In response to this and other reviewers' comments, we have added new western data and quantification to Figure 1. We now focus on MLC/pMLC data as we believe these data highlight the potential Rho-independent mechanism of ARHGAP18, and we were able to greatly improve the quality of the blots through careful optimization. We hope the reviewer finds these blots and quantifications (Fig. 1E and F) more convincing.

      *We note that phospho-specific Western blotting presents considerably greater technical challenges than conventional blotting. We believe that the appearance of an attractive looking blot does not always correlate to quality or reproducibility and have focused on taking extraordinarily careful steps in the blotting of our phospho-specific antibodies, which at times comes at the cost of the blot's attractiveness in appearance. For example, all phospho-specific antibodies are run using two color fluorescent markers to blot against both the total protein and the phospho-protein on the same blot. This approach often leads to blots that have reduced signal to noise compared to chemiluminescent Westerns. Additionally, we use phospho-specific blocking buffer reagents which do not contain phosphate-based buffers or agents that attract non-specific phospho-staining signals. These blocking buffers are not as effective as non-fat milk in pbs at blocking the background signal, however, they are ultimately cleaner for phospho-specific primary antibodies. We use carefully optimized protocols, from cell treatment to lysis, transfer, and antibody incubation, including methods developed by laboratories where the corresponding author of the manuscript was trained. Nonetheless, despite these efforts, we have now removed the LIMK and cofilin data because we deemed them unnecessary for the main conclusions of this manuscript and were unable to improve their quality to satisfy the reviewer. *

      The changes in pMLC on the western blots are very small, and for any conclusion, these studies require quantification. Further, the expression levels of Flag-ARHGAP18 needs to be shown to support the statement that the protein is expressed, and indeed overexpressed under these conditions (vs just re-expressed).

      In continuation of the above comment, we have made significant effort to improve the quality of our pMLC western blots and now provide quantification in Figure 1. We also now provide the Flag-ARHGAP18 signal as requested by the reviewer.

      Fig 4: the differences in YAP nuclear localization under the various conditions are not well visible. Quantitation of nuclear/cytosolic signal ratio should be provided. Please provide a rationale and more context for using serum starvation and re-addition. What is the expected effect? Serum removal and addition is referred to as nutrient removal and re-addition, but this is inaccurate, as it does not equal nutrient removal, since serum contains a variety of other important components, e.g. growth factors too.

      We have provided new quantification of the nuclear/cytosolic signal ratio in Figure 6D. We have explained our rational for the study through the following new text:

      "Merlin is activated and localized to junctions upon signaling, promoting growth and proliferation; among these signals is the availability of growth factors and other components of serum (Bretscher et al., 2002). We hypothesized that since ARHGAP18 formed a complex with Merlin that ARHGAP18's localization may localize to junctions under conditions which promote Merlin activation."

      • *

      We have altered our use of "nutrient removal" to "serum removal"

      The binding between ARHGAP18 and merlin is interesting, but a key limitation is the use of expressed proteins. Can the binding be shown for the endogenous proteins (IP, colocalization). Another important unaddressed question is the relevance of this binding, and the relation of this to altered YAP nuclear localization.

      • *

      *Our data in Fig. 6G shows binding of a resin bound human ARHGAP18 to endogenous YAP from human cells as suggested by the reviewer. In Fig. 6A, we have selected to use GFP-Merlin as Merlin shares approximately 60% sequence identity with Ezrin, Radixin, and Moesin (ERMs). Their similarity is such that Merlin was named for Moesin-Ezrin-Radixin-Like Protein. In our experience, nearly all Merlin or ERM antibodies have some cross-contaminating signal. Thus, a major concern is that if we were to blot for endogenous Merlin in the pull-down experiment, we may see a band that could in fact be ERMs. To avoid this, we tagged Merlin with GFP to ensure that the product pulled down by ARHGAP18 was Merlin, not an ERM. Regarding the ARHGAP18-resin bound column, our homemade ARHGAP18 antibody is polyclonal. We have extensive experience in pulldown assays and have found that the binding of a polyclonal antibody to the bait protein can produce less accurate results, as the binding site for the antibody is unknown and can sterically hinder attachment of target proteins like Merlin. In our experience, attachment to a flag-tag, which is expressed after a flexible linker at the N- or C-terminus, allows us to overcome this limitation, which we've used in this manuscript. *

      Minor comments:

      Introduction line 99: "When localized to the nucleus, YAP/TAZ promotes the activation of cytoskeletal transcription factors associated with cell proliferation and actin polymerization" Please clarify what you mean by this statement, that is inaccurate in its present for. Did you mean effects on transcription factors that control cytoskeletal proteins, or do you mean that Yap/Taz affect these proteins? Please also provide reference for this.

      We've altered the sentence as suggested by the reviewer, which now reads the following:

      "When localized to the nucleus, YAP/TAZ promotes transcriptional changes associated with cell proliferation and actin polymerization."

      • *

      *The full mechanism for how YAP/TAZ promotes proliferation and actin polymerization is a currently debated issue. We do not think introducing the various current proposed models is required for this manuscript, and we simply intend to convey that when in the nucleus, YAP/TAZ promotes transcriptional changes that drive actin polymerization and cell proliferation. *

      -What is the cell confluence in these experiments? For epithelial cells confluence affects actin structure. Please comment on similarity of confluency across experimental conditions?

      • *

      All cellular experiments are paired where WT and ARHGAP18 KO cells are plated at the same time under identical conditions. For imaging, we plate all cells onto glass coverslips in a 6 well dish so that each condition is literally in the same cell culture plate and gets identical treatment. In our prior Elife paper studying ARHGAP18, we characterized that ARHGAP18 KO cells and WT cells divide at a similar rate and have similar proliferation characteristics. The epithelial cell cultures are maintained for experiments around 70-80% confluency. For the focal adhesion staining experiments, the confluency is slightly lower, between 50-60% to capture the focal adhesions towards the leading edge. We have added the following new text to further describe these methods: "Cell cultures for experiments were maintained at 70%-80% confluency. For focal adhesion experiments, the cell cultures were maintained at 50%-60% confluency."

      -Fig 2 legend: please indicate that the protein detected was non-muscle myosin heavy chain (distinct from the light chain detected in Fig 1).

      • *

      We have altered original Figure 2 (new Figure 3) legend.

      -Line 339-340: please check the syntax of this sentence -Western blot quantification: the comparison of experiments with samples run on different gels/blots requires careful normalization and experimental consistency. Please describe how this was achieved.

      • *

      We have added the following new text to further describe these methods:

      "For blots which required quantification of antibodies that were only rabbit primaries (e.g., pMLC/MLC antibodies listed above), samples were loaded onto a single gel and transferred onto a single membrane at the same time. After transfer, the membrane was cut in half and subsequent steps were done in parallel. All quantified blots were checked for equal loading using either anti-tubulin as a housekeeping protein or total protein as detected by Coomassie staining"

      Reviewer #3 (Significance (Required)):

      Rho signalling is a central regulator of an array of normal and pathological cell functions, and our understanding of the context dependent regulation of this key pathway remains very incomplete. Therefore, new knowledge on the role of specific regulators, such as ARHGAP18, is of interest to a very broad range of researchers. A further exciting aspect of this protein, that despite indications by many studies that it acts as a GAP (inhibitor) for Rho proteins, there are findings in the literature that suggest that its manipulation can affect actin in unexpected (opposite) manner. These point to possible Rho-independent roles, and warranted further in-depth exploration.

      One of the strength of the study is that it explores possible roles of ARHGAP18 beyond RhoA and describes some new and interesting observations, which advance our knowledge. The authors use some excellent tools (e.g. ARHGAP KO cells and re-expression) and approaches (e.g. super resolution microscopy to analyze actin changes, RNAseq and bioinformatics to find genes that may be downstream from ARHGAP18). A key limitation of the study however, is that it is not clear whether the observed findings are indeed independent from RhoA. Further limitation is that potential causal relationships between the described findings are not studied, and therefore the findings are in some cases overinterpreted, and limited mechanistic insights are provided. In some cases the exclusive use of expressed proteins is also a limitation. Finally, some of the experiments also need improvement.

      Reviewer expertise: RhoA signalling, guanine nucleotide exchange factors, epithelial biology, cell migration, intercellular junctions.

      In the above comments, we detail the new experimental data addressing reviewer 3's listed key limitations. We've added new data using the Rho GAP deficient ARHGAP18(R365A) variant which allows for the direct characterization of ARHGAP18's Rho independent activity. We have introduced new data in WT cells studying endogenous proteins to address the limitations from expressed proteins. Finally, we have moderated our language to address overinterpretation. Collectively, we believe that our revised manuscript addresses the constructive reviewer's comments.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Dear editor and reviewers,

      We sincerely thank you for your thoughtful comments and constructive suggestions, which have greatly improved the quality and clarity of our manuscript. In response, we have implemented all requested changes, which are highlighted in yellow throughout the revised text, and updated several figures accordingly. Furthermore, we have performed all additional experiments recommended by the reviewers and incorporated the new data into the manuscript. To enhance clarity, we have also included a schematic representation of our proposed model in an additional figure, providing a concise visual summary of our findings.

      We hope that these revisions fully address all concerns raised by the reviewers and meet all the expectations for publication.

      Below, we answer the reviewers point by point (in blue).


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this paper, the authors address the important question of the role of centrosomes during neuronal development. They use Drosophila as an in vivo model. The field is somewhat unclear on the role and importance of centrosomes during neuronal development, although the current data would suggest they are dispensable for axon specification and growth. Early studies in cultured mammalian neurons showed that centrosomes are active and that their microtubules can be cut and transported into the neurites. But a study then showed that centrosomes in these cultured neurons are deactivated relatively early during neuronal development in vitro and that ablating centrosomes even when they are active had no obvious effect on axon specification and growth. Consistent with this, a study in Drosophila provided evidence that centrosomes were not active or necessary in different types of neurons. More recently, a study showed that centrosomal microtubules are dispensable for axon specification and growth in mice in vivo but are required for neuronal migration in the cerebral cortex. However, another study has linked the generation of acetylated microtubules at centrosomes with axon development. In this current study, the authors examine the effect of centrosome loss on various motor and sensory neurons and muscles mainly by examining mutants in essential centriole duplication genes. They associate axonal routing and morphology defects with centrosome loss and provide some evidence that centrosomes could still be active in the developing neurons. Overall, they conclude that centrosomes are active during at least early neuronal development and that this activity is important for proper axonal morphology and routing.

      While I think this study addressing a very interesting and important question, I think as it stands the data is not sufficient to be conclusive on a role for centrosomes during neuronal development. My biggest concern is that most phenotypes have not yet been shown to be cell autonomous, as whole animal mutants have been analysed rather than analysing the effect of cell-specific depletion, and the evidence for active centrosomes needs to be strengthened. If the authors can provide stronger evidence for a role of centrosomes in axonal development then the paper will certainly be of interest to a broad readership.

      We thank the reviewer for the clear and concise summary and fully agree that our study addresses a critical gap in understanding. Centrosomes have long been implicated in morphogenesis, yet their precise contribution to nervous system development has remained unclear. Our findings provide compelling evidence that centrosomes are indispensable for proper nervous system formation and that their absence also triggers muscular defects, highlighting their broader role in tissue organization.

      We acknowledge that the original manuscript lacked some key details; therefore, we have now strengthened our conclusions with additional experiments. Specifically, we demonstrate that these effects are cell-autonomous by using two independent RNAi lines targeted to a subset of motor neurons. Furthermore, we present new data showing that neuronal centrosomes remain active during the early stages of axonal development, emphasising their functional relevance in morphogenesis. All new experiments, figures, and corresponding text revisions are detailed below.

      Major comments 1) The sas-6 transallelic combination shows only 17% embryonic lethality compared to 50% embryonic lethality with sas-4 mutants. Given that both mutants should result in the same degree of centrosome loss (this should be quantified in sas-6 mutants) it would suggest that either sas-4 has other roles away from centrosomes or that the sas-4 mutant chromosome used in the experiment has other mutations that affect viability. The effect of picking up "second-site lethal" mutations on mutant chromosomes is common and so I would not be surprised if this is the reason for the difference in phenotypes. This can be addressed either by "cleaning up" the sas-4 mutant chromosome by backcrossing to wild-type lines, allowing recombination to occur and replace the potential second site mutations, or by using transallelic combinations of sas-4, as they did for sas-6. The "easier" option may just be to analyse all the phenotypes with the sas-6 transallelic combination.

      We appreciate this comment, as it brought to light an issue with the CRISPR line Sas-6-Δa. Upon reanalysing all the data, we determined that this line is embryonic lethal both in homozygosis and when combined with the deficiency uncovering the genomic region, Df(3R)BSC794. In contrast, Sas-6-Δb homozygotes are viable. The inconsistency between these results raised concerns about whether the Δa and Δb Sas-6 mutants carry deletions confined to the Sas-6 coding region. Although this would not hinder our cell biology analysis, it could represent a problem in viability tests. To address this, we repeated all analyses using Sas-6-Δb homozygotes and Sas-6-Δb combined with Df(3R)BSC794. These new results are more consistent and indicate that approximately 50% of Sas-6/Def individuals hatch as adults. Fig. 3 was redone and the manuscript text changed in view of these results.

      2) Using "whole animal" mutants for assessing neuronal morphology is risky due to non-cell-autonomous effects. The authors have carried out some phenotypic analysis of neurons depleted of Sas-4 by cell-specific RNAi, but I feel they need to do this for all of their analysis. This includes embryonic lethality measures, quantification of centrosome numbers, and all axonal phenotypes in Sas-4 RNAi neurons. It would also be prudent to use 2 distinct RNAi lines to help ensure any phenotypes are not off-target effects (and this may help clarify why the authors see some additional phenotypes with RNAi). Indeed, there are relatively weak phenotypes in muscles when using RNAi compared to the mutants and these potential non-cell-autonomous effects could then have a knock-on effect on neuronal morphology. If the authors were concerned that RNAi is not very efficient (explaining any potential weaker phenotypes than in mutants) the authors could examine the effectiveness of RNAi lines by analysing protein depletion by western blotting or mRNA depletion by rt-qPCR (although this has to be done in a different cell type due to the difficulty in obtaining a neuronal extract).

      We have now added a new panel to supplementary Figure 1, showing how the expression of a different Sas-4 RNAi line (2) induces similar nervous system phenotypes when expressed only in aCC, pCC and RP2 pioneer neurons (Sup. Fig. 1 M-O).

      3) When analysing centriole presence or absence it is a good idea to stain with two different centriole markers e.g. Asl and Plp. This helps rule out unspecific staining. It is clear from the images that similar sized foci can be observed outside of the cells (see Figure 5A for example), so clearly some of the foci that appear to be within the cells may also be unspecific staining.

      In a new supplementary figure, we now show that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). In addition, and we apologise for the confusion, but the reason why there are foci outside the marked cells is because these are wholemount embryonic stainings and the anti-Plp antibody marks all centrosomes in all cells in the embryo.

      4) The evidence for active centrosomes is not that convincing. Acetylated tubulin is associated with stable MTs, which are not normally organised by "active" centrosomes that nucleate dynamic microtubules. Moreover, it is plausible that centriole foci happen to overlap with the acetylated tubulin staining by chance. This would explain why not all centrosomes colocalise with acetylated tubulin signal. The authors could better test centrosome activity by performing live imaging with EB1-GFP. If centrosomes are active, it is very easy to observe the many comets produced by the centrosomes.

      We appreciate the reviewer’s comment and agree that acetylated tubulin alone is not an ideal marker for centrosome activity. To address this, we performed live imaging of aCC neurons expressing EB1-GFP together with Asl-Tomato. This was technically challenging because we were imaging only two neurons per segment in live embryos, under significant limitations in fluorescence detection and timing. Despite these constraints, we were able to clearly observe EB1 comets emerging from the centrosome and moving toward the cell periphery, providing direct evidence of microtubule nucleation from centrosomes in neurons.

      Importantly, we complemented this with a microtubule depolymerization/polymerization assay, which provides unequivocal evidence that polymerization initiates at the centrosome. After depolymerization, we observed microtubule regrowth from the centrosome, confirming its role as an active microtubule-organizing centre in these neurons. Together, we hope that these results are enough to demonstrate that neuronal centrosomes are functionally active during early axonal development. These experiments are presented in Figure 6 and corresponding text in the manuscript.

      5) If the authors believe that centrosomes have a role in axon pathfinding in sensory neurons, they should show that these centrosomes are active, at least during early stages (again using EB1-GFP imaging).

      We appreciate the reviewer’s suggestion and agree that EB1-GFP imaging would be the most direct way to assess centrosome activity in sensory neurons. However, performing time-lapse imaging in these neurons is technically very demanding due to their location and accessibility in live embryos, and we did not attempt this approach. Instead, we now provide new evidence showing that sensory neuron centrosomes colocalize with both α-tubulin and γ-tubulin. This strongly supports that these centrosomes are associated with microtubule nucleation machinery and are as likely as motor neuron centrosomes to be active during early stages of axon development. These new data have been included in the revised manuscript (see Figure 5 and corresponding text).

      6) The authors mention in the discussion that "increased JNK activity, can result in axonal wiggliness (Karkali et al, 2023)". I therefore wonder whether centrosome loss may induce JNK activation (the stress response), as this would then indicate an indirect effect of centrosome loss on axonal structure rather than a direct influence of centrosome-generated microtubules. The authors could assess whether the DNK-JNK pathway is activated in neurons lacking centrosomes by expression UAS-Puc-GFP and quantifying the nuclear signal.

      In a new supplementary figure, we now show by using a reporter for JNK signalling, as requested, that Sas-4 neurons do not activate the JNK pathway (Supl. Fig 4).

      7) In Figure 5, the authors claim that they find "a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". I don't think this is a strong correlation. The difference in centriole number between embryos with no defects and those with defects is very small. In contrast, the difference between centriole numbers in control (no defects) and mutant (no defects) is very large. So, there does not appear to be a strong correlation between centrosome number and phenotype.

      We agree and we have corrected this sentence to better explain the results.

      Minor comments

      1) I don't understand Figure 3C - why do the % of surviving homozygotes and heterozygotes add up to 100%? Should the grey boxes not relate to dead and the white to surviving?

      Thank you for pointing this out. Figures 1B and 3C represent only the surviving individuals. The grey boxes correspond to surviving homozygotes, and the white boxes correspond to surviving heterozygotes. The percentages add up to 100% only at embryonic stages because all embryos reach late embryonic stages. The grey and white boxes reflect the proportion of these two genotypes among the survivors, not the total number of embryos including those that died. We have changed the text to convey this.

      2) "In mouse fibroblasts, myoblasts and endothelial cells, centrosome orientation is important for nuclear positioning and cell migration(Chang et al, 2015; Gomes et al, 2005; Kushner et al, 2014)." Do you mean "centrosome position"?

      Yes, text changed, thank you for spotting it.

      3) In the introduction, the authors mention Meka et al. when saying the centrosomal microtubules are important for axonal development, but they should also discuss the counter argument from Vinopal et al., 2023 (Neuron) that showed how centrosomes were required for neuronal migration but not axon growth, which was instead mediated by Golgi-derived microtubules.

      Done, thank you very much.

      4) Lines 228-230 - repeated sentence

      Corrected, thank you very much.

      5) Additionally, we did not detect centrioles in the quadrant opposite the axon exit point (Fig. 2B n=75) - this data is not in Fig 2B

      Correct, it is in figure 4B, thank you very much.

      6) "This significant decrease in the humber of centrioles further supports the critical role of Sas-4 in pioneer neurons of the ventral nerve cord (VNC) during Drosophila embryogenesis". It rather highlights that Sas-4 is required for centriole formation in these neurons. Also, humber = number.

      We agree, and have changed the text, thank you very much.

      7) Result title: Non-ciliated sensory neurons have centrioles. This is kind of obvious. A better title may be "axon phenotypes correlate with centriole numbers in sensory neurons" but unfortunately i don't think there is good evidence for this (See major point above).

      We agree and we have changed. We now believe we have strong evidence to support it. We hope the additional data presented in the revision convincingly demonstrate this point.

      Reviewer #1 (Significance (Required)):

      As mentioned above, the advance will be important if more evidence is provided. In this case, the paper will be interesting to a broad readership. But currently the paper is limited by the lack of evidence for centrosome function and activity in the neurons.

      We hope that reviewer 1, now considers that the manuscript is not limited anymore and that it shows convincing evidence for centrosome function and activity in embryonic neurons.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: In this manuscript, Gonzalez et al. examine the potential function of centrosomes in the neurons and muscle cells of Drosophila embryos. By studying various mutant and RNAi lines in which centriole duplication has been disrupted, they conclude that the loss of centrioles disrupts axonal pathfinding and muscle integrity.

      Major points: 1. Throughout the manuscript, the phenotypes presented are often quite subtle. For this reason, I would really recommend that these experiments are scored blind. Perhaps the authors did this, but I didn't see any mention of this.

      All our phenotypic analyses are performed blind. We apologize for not having originally included this information in the Methods section; it has now been added. Embryos are stained using colorimetric methods (DAB) to label the nervous system, while balancer chromosomes are marked with a fluorescent antibody. This approach allows us to assess and quantify phenotypes using white light without knowing whether the embryos are homozygous mutants or heterozygous, which can only be detected by changing the channels to fluorescence.

      1. The authors conclude that neurons have active centrioles that function as centrosomes (Figure 6), but the data here is confusing. The authors state that in these cells they observe acetylated MTs extending from the centrosomes and these colocalised with g-tubulin. But the authors don't show the overlap between centrosomes, g-tubulin and MTs, as they stain for these separately. This is problematic, as it was not clear from these images that the majority of the MTs really are extending from the centrosome: the centrosome may just associate or be close by to these MT cables (Figure 6A,B). Moreover, the authors show that only a fraction of the centrosomes in these cells associate with g-tubulin, so presumably in cells where the centrosomes lack g-tubulin they would not expect the centrosomes to be associated with the MTs-but they do not show that this is the case. Perhaps the authors can't test this, but an alternative would be to show that these MT arrays are absent in Sas-4 mutants. This would give more confidence that these MTs arise from the centrosomes.

      We agree that the initial data based on acetylated microtubules and γ-tubulin colocalization were not sufficient to conclude that microtubules originate from the centrosome, as these markers can only suggest association. To address this, we have now included additional experiments that provide direct evidence of centrosome activity.

      First, we performed live imaging of aCC neurons expressing EB1-GFP together with Asl-Tomato. Despite the technical challenges of imaging only two neurons per segment in live embryos under strict fluorescence and timing constraints, we were able to clearly observe EB1 comets emerging from the centrosome and moving toward the cell periphery. This demonstrates active microtubule nucleation from centrosomes rather than mere proximity to microtubule bundles.

      Second, we carried out a microtubule depolymerization/polymerization assay, which provides unequivocal evidence that polymerization initiates at the centrosome. After depolymerization, microtubules regrew from the centrosome, confirming its role as an active microtubule-organizing center. These experiments go beyond colocalization and directly address the concern that centrosomes might simply be adjacent to microtubule cables.

      Regarding the suggestion to use Sas-4 mutants, while we did not perform this experiment, the regrowth assay combined with EB1 imaging strongly supports that these microtubules originate from the centrosome. All new data are presented in Figure 6 and the corresponding text in the revised manuscript.

      1. The authors show that muscle cell integrity is compromised by centriole-loss (Figure 2). This is very surprising as it is widely believed that centrosomes are non-functional in muscle cells, and the MTs are instead organised around the nuclear envelope. I'm not aware of the situation in Drosophila muscle cells, but the authors should ideally try to examine if the centrioles are functioning as centrosomes in these cells. At the very least they should discuss how they think centriole-loss is influencing the muscle integrity when it is widely believed they are inactive in these cells.

      We do not claim that centrosomes are active in muscle cells at these developmental stages. The observed muscle defects could result from earlier processes such as cell division, migration, or muscle fusion. We agree that this is an intriguing observation; however, pursuing this question further would go beyond the scope of the current manuscript. As requested by the reviewer, we have now expanded the discussion to consider how centriole loss might impact muscle integrity.

      Regardless of the strength of the supporting data, I think the authors should tone down their conclusions. The title and abstract led me to believe that centriole loss would cause significant problems in axonal pathfinding and muscle integrity. In all the mutant specimens examined (and certainly the low magnification views shown in Figure 1D'-F', Figure 1I'-K' and Figure 2D'-F') the mutants look very similar to the WT. Many readers may not get past the title and abstract, so the authors should make it clearer that these defects are very subtle.

      We have changed the text to convey this idea.

      Minor points: 1. In Figures 4 and 5, CP309 staining is relied on to identify centrioles, but there is quite a background of non-specific dots, making it hard to be certain what is a centriole and what isn't. For example, in Figure 5D' there are lots of dots within some of the cells - are any of these centrioles? How can the authors be certain which dot is a centriole in some of the cells shown in Figure 5C'? Is it possible to use a second marker and only count as centrioles dots that are recognised by both antibodies?

      We thank the reviewer for this suggestion and agree that using a second marker improves confidence in centriole identification. In a new supplementary figure (Supplementary Fig. 3), we now show that Asl and Plp colocalize in neurons and provide a quantification of the frequency of this colocalization. This dual labelling confirms the identity of centrioles and addresses the concern about non-specific background.

      We also apologize for any confusion regarding the presence of foci outside the marked cells. These images are whole-mount embryonic stainings, and the anti-Plp antibody labels all centrosomes in all cells of the embryo, which explains the additional foci observed.

      In the abstract that authors state that traditionally centrosomes have been considered to be non-essential in terminally differentiated cells. I don't think this is correct. In the standard "textbook" view of a cell, the centrosome is normally positioned in the centre of the cell organising an extensive array of MTs that are thought play an important role in organising intracellular transport, the positioning and movement of organelles and the maintenance and establishment of cell polarity. I don't think it is only recent evidence that suggests they play vital roles in terminally differentiated cells.

      We thank the reviewer for this correction and we have changed the text accordingly.

      1. Line 162 the authors state that in the RNAi knockdown lines they observe several additional phenotypes, but then in the same sentence (Line 164) they say that these defects were also observed in the original mutant and mutant/Df lines.

      We apologise for this confusion, we have rearranged the sentence for clearance.

      The sentences in Line281-287 don't reference any of the Figures, so it seems the authors are just stating these results without presenting any data (e.g. "Significantly, we also found a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". If they've tested this correlation, they should show it.

      We have rearranged the sentences for better understanding.

      In Figure 7 I did not understand how the authors measured tortuosity (wiggliness) and could see no description in the methods. This is important as, again the defect seems quite subtle, but perhaps I am not understanding which bits of the axon are being measures. Is it just the small bit of the axons close to the asterixis that is being measured, or the whole FasII track?

      We have now added another quantification and additional descriptions in the methods section.

      Reviewer #2 (Significance (Required)):

      The potential function of centrosomes in axonal outgrowth is quite controversial, so this study is potentially of considerable interest.

      However, several aspects of the data presented here were confusing or not terribly convincing. In its present state, I don't think the main conclusions are strongly enough supported by the data.

      We hope that reviewer 2, now considers that the manuscript is not confusing anymore and that it shows convincing evidence for centrosome function and activity in embryonic neurons.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript of González et al. entitled "Centriole Loss in Embryonic Development Disrupts Axonal Pathfinding and Muscle Integrity" deals with the role of centrosomes in shaping axonal morphology. To this aim the AA analysed Drosophila Sas-4 mutants that are reported to develop until adult stage without centrioles. Remarkably, the AA observe that 50% of the homozygous mutant embryos fail to hatch as larvae. The present observations suggest that centrosome loss results in axonemal shaping defects and muscle developmental abnormalities. Finally, the AA show the presence of functional centrosomes in neurons. In my opinion, the manuscript is interesting because shows unexpected findings. However, to justify these new findings the AA are required to improve some experimental observations.

      We thank the reviewer for his summary of our work and for considering it interesting. We have taken into account all the comments and believe that these have helped improve our manuscript.

      Major: Abstract- It is unclear in which phenotypic condition the observations of centrosome loss or centrosome presence have been found. Please better explain. l.36. embryos, larvae, adult, from Sas4 or controls? If mutants, the observations are very interesting since Sas4 would be without centrioles. Indeed, Basto et al., show that chemosensory neurons do not develop an axoneme in the absence of centrioles, but extend dendrites toward the sensory bristle.

      We have made clear which refer to wild-type and which are Centriole Loss (CL) conditions. CL conditions refer to mutant and downregulation conditions, whereas targeted downregulation refers to RNAi downregulation only in neurons.

      I do not think appropriate the use of "centriole" in the main title since the centrioles would be localized by true centriolar antigens rather than by centrosomal antigens. This problem occurs throughout the text and some figures where the AA image centrioles by centrosomal material. In Gig. 5A only the AA properly look at Asl localization. The other pictures of presumptive centrioles or centriole quantification report CP309 dots. This localization does not unequivocally reveal centrioles, since CP309 is essentially required for centrosome-mediated Mt nucleation. There are differentiated Drosophila tissues in which centrioles are present, but inactivated, and unable to recruit pericentriolar material. Mt are nucleated by ncMTOCs that contain centrosomal material and gamma-tubulin. Thus, the centrosomal antigens do not colocalize with centrioles.

      We have changed centrioles to centrosomes in the title and most sections in the manuscript. We have also included an extra control, showing that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). Asl is a reliable and widely used marker for centrioles, as it localizes specifically to the centriole structure (Varmark H, Llamazares S, Rebollo E, Lange B, Reina J, Schwarz H, Gonzalez C. Asterless is a centriolar protein required for centrosome function and embryo development in Drosophila. Curr Biol. 2007 Oct 23;17(20):1735-45. doi: 10.1016/j.cub.2007.09.031. PMID: 17935995.)

      Minor: l. 58. The early arrest is mainly due to a checkpoint control. In double mutant for Sas4 and P53 the embryos survive longer, even if their further development is asrrested.

      We thank the reviewer for this comment, and we have changed the text accordingly.

      1. Previous works, also quoted by the AA, reported that in mature neurons the centrosome are inactivated, whereas the present manuscript describes functional centrosomes in Drosophila motor and peripheral nervous system. This is an intriguing observations that needs a better explanation in Discussion section.

      We thank the reviewer for this comment, and we have changed the discussion accordingly.

      l.143-145. I understand that 50% of the Sas4 embryos that reach the adult stage have centrioles. Is it correct? But if it is so, how the AA explain the absence of centrioles in sensory neurons of adult flies as reported by Basto et al. ?

      According to our results they have less centrioles than controls already at embryonic stages. In addition, as reported in Basto et al. they continue losing centrioles during larval stages and metamorphosis, which explains why centrioles are not detected at adult stages.

      l.215. It is unclear for me why the AA analyse Sas6 flies, unless explain the mutant phenotype.

      To strengthen our conclusions with Sas-4 and exclude the possibility that the observed phenotypes arise from a centrosome-independent function of Sas-4. For this reason, we have taken additional steps to confirm that the effects are specifically due to centrosome loss and we used Sas-6 mutants as one of these.

      l.221. How the centrioles have been quantified? What antibody, the AA used.

      We have quantified centrosomes using antibodies agains Plp (CP309) and Asl-YFP expression.

      l.244. and Fig 4C,D. I see high background with CP309. As reported previously I think better to use antibodies against centriolar proteins, such as Sas6, Ana1, Asl, or Sas4 ( if centrioles are present in 50% of mutants as the AA claim, the antibody could be also useful). In addition, I can see some CP309 spots in Fig 4E,F. Are they centrioles?

      Indeed, as we report, Sas-4 mutant embryos are not totally devoid of centrosomes. In addition, and we apologise for the confusion, but the reason why there are foci outside the marked cells in control embryos is because these are wholemount embryonic stainings and the anti-Plp antibody marks all centrosomes in all cells in the embryo, not just in the neurons.

      l.270 and Fig. 5A and Fig.5 C-E. Why the AA localize Cp309 and not Asl (Fig. 5A) to detect centrioles?

      In a new supplementary figure, we now show that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). So, we can use CP309 in neurons to the same effect as Asl-

      L295-296. I cannot see Mts, but only a diffuse staining. I am expecting to see distinct Mt bundles.

      In figure 5 it is now easier to see the MT bundles in the new experiment in Fig. 5F-I , where we performed MT depolymerisation/repolymerisation: Nevertheless, we need to stress out that we are doing these analyses in wholemount embryonic stainings.

      326-327. How the AA explain this different lethality, even if both the proteins are involved in centriole assembly?

      We have now redone all the viability and mutant phenotype analysis using Sas-6 CRISPR mutant over the Deficiency, which is a better way to access the phenotype.

      335-337. In my opinion the quoted publications are not relevant.

      We believe that these references back up our hypothesis because:

      • Metzger et al 2012 stress the importance of nuclear position in muscle development in Drosophila
      • Loh et al 2023, relate centrosomes with nuclear migration in Drosophila
      • Tillery et al 2018, is a review describing MTs in muscle development in Drosophila.

      358-359. Does maternal contribution persist after gastrulation?

      While bulk degradation occurs by midblastula transition, some stable maternal products persist beyond gastrulation. In our case, if centrioles are formed due to the maternal contribution, they will only be diluted by cell division, which explains why we can detect centrioles at late embryonic stages.

      l.366. This is an intriguing point, but as previously observed I have some problem with centriole localization. References. Please uniform Journal abbreviations and control page numbers.

      I hope we have clarified this problem with the new experiments showing MT repolarization from the centrosomes in neurons.

      Reviewer #3 (Significance (Required)):

      The manuscript is potentially interesting for peoples working of cell and molecular biology, and development. However, the paper needs an additional working to be suitable for publication.

      We hope that reviewer 3, considers that the additional work and revision make this manuscript suitable for publication.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary: In this manuscript, Gonzalez et al. examine the potential function of centrosomes in the neurons and muscle cells of Drosophila embryos. By studying various mutant and RNAi lines in which centriole duplication has been disrupted, they conclude that the loss of centrioles disrupts axonal pathfinding and muscle integrity.

      Major points:

      1. Throughout the manuscript, the phenotypes presented are often quite subtle. For this reason, I would really recommend that these experiments are scored blind. Perhaps the authors did this, but I didn't see any mention of this.
      2. The authors conclude that neurons have active centrioles that function as centrosomes (Figure 6), but the data here is confusing. The authors state that in these cells they observe acetylated MTs extending from the centrosomes and these colocalised with g-tubulin. But the authors don't show the overlap between centrosomes, g-tubulin and MTs, as they stain for these separately. This is problematic, as it was not clear from these images that the majority of the MTs really are extending from the centrosome: the centrosome may just associate or be close by to these MT cables (Figure 6A,B). Moreover, the authors show that only a fraction of the centrosomes in these cells associate with g-tubulin, so presumably in cells where the centrosomes lack g-tubulin they would not expect the centrosomes to be associated with the MTs-but they do not show that this is the case. Perhaps the authors can't test this, but an alternative would be to show that these MT arrays are absent in Sas-4 mutants. This would give more confidence that these MTs arise from the centrosomes.
      3. The authors show that muscle cell integrity is compromised by centriole-loss (Figure 2). This is very surprising as it is widely believed that centrosomes are non-functional in muscle cells, and the MTs are instead organised around the nuclear envelope. I'm not aware of the situation in Drosophila muscle cells, but the authors should ideally try to examine if the centrioles are functioning as centrosomes in these cells. At the very least they should discuss how they think centriole-loss is influencing the muscle integrity when it is widely believed they are inactive in these cells.
      4. Regardless of the strength of the supporting data, I think the authors should tone down their conclusions. The title and abstract led me to believe that centriole loss would cause significant problems in axonal pathfinding and muscle integrity. In all the mutant specimens examined (and certainly the low magnification views shown in Figure 1D'-F', Figure 1I'-K' and Figure 2D'-F') the mutants look very similar to the WT. Many readers may not get past the title and abstract, so the authors should make it clearer that these defects are very subtle.

      Minor points:

      1. In Figures 4 and 5, CP309 staining is relied on to identify centrioles, but there is quite a background of non-specific dots, making it hard to be certain what is a centriole and what isn't. For example, in Figure 5D' there are lots of dots within some of the cells - are any of these centrioles? How can the authors be certain which dot is a centriole in some of the cells shown in Figure 5C'? Is it possible to use a second marker and only count as centrioles dots that are recognised by both antibodies?
      2. In the abstract that authors state that traditionally centrosomes have been considered to be non-essential in terminally differentiated cells. I don't think this is correct. In the standard "textbook" view of a cell, the centrosome is normally positioned in the centre of the cell organising an extensive array of MTs that are thought play an important role in organising intracellular transport, the positioning and movement of organelles and the maintenance and establishment of cell polarity. I don't think it is only recent evidence that suggests they play vital roles in terminally differentiated cells.
      3. Line 162 the authors state that in the RNAi knockdown lines they observe several additional phenotypes, but then in the same sentence (Line 164) they say that these defects were also observed in the original mutant and mutant/Df lines.
      4. The sentences in Line281-287 don't reference any of the Figures, so it seems the authors are just stating these results without presenting any data (e.g. "Significantly, we also found a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". If they've tested this correlation, they should show it.
      5. In Figure 7 I did not understand how the authors measured tortuosity (wiggliness) and could see no description in the methods. This is important as, again the defect seems quite subtle, but perhaps I am not understanding which bits of the axon are being measures. Is it just the small bit of the axons close to the asterixis that is being measured, or the whole FasII track?

      Significance

      The potential function of centrosomes in axonal outgrowth is quite controversial, so this study is potentially of considerable interest.

      However, several aspects of the data presented here were confusing or not terribly convincing. In its present state, I don't think the main conclusions are strongly enough supported by the data.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      In this paper, the authors address the important question of the role of centrosomes during neuronal development. They use Drosophila as an in vivo model. The field is somewhat unclear on the role and importance of centrosomes during neuronal development, although the current data would suggest they are dispensable for axon specification and growth. Early studies in cultured mammalian neurons showed that centrosomes are active and that their microtubules can be cut and transported into the neurites. But a study then showed that centrosomes in these cultured neurons are deactivated relatively early during neuronal development in vitro and that ablating centrosomes even when they are active had no obvious effect on axon specification and growth. Consistent with this, a study in Drosophila provided evidence that centrosomes were not active or necessary in different types of neurons. More recently, a study showed that centrosomal microtubules are dispensable for axon specification and growth in mice in vivo but are required for neuronal migration in the cerebral cortex. However, another study has linked the generation of acetylated microtubules at centrosomes with axon development. In this current study, the authors examine the effect of centrosome loss on various motor and sensory neurons and muscles mainly by examining mutants in essential centriole duplication genes. They associate axonal routing and morphology defects with centrosome loss and provide some evidence that centrosomes could still be active in the developing neurons. Overall, they conclude that centrosomes are active during at least early neuronal development and that this activity is important for proper axonal morphology and routing.

      While I think this study addressing a very interesting and important question, I think as it stands the data is not sufficient to be conclusive on a role for centrosomes during neuronal development. My biggest concern is that most phenotypes have not yet been shown to be cell autonomous, as whole animal mutants have been analysed rather than analysing the effect of cell-specific depletion, and the evidence for active centrosomes needs to be strengthened. If the authors can provide stronger evidence for a role of centrosomes in axonal development then the paper will certainly be of interest to a broad readership.

      Major comments

      1. The sas-6 transallelic combination shows only 17% embryonic lethality compared to 50% embryonic lethality with sas-4 mutants. Given that both mutants should result in the same degree of centrosome loss (this should be quantified in sas-6 mutants) it would suggest that either sas-4 has other roles away from centrosomes or that the sas-4 mutant chromosome used in the experiment has other mutations that affect viability. The effect of picking up "second-site lethal" mutations on mutant chromosomes is common and so I would not be surprised if this is the reason for the difference in phenotypes. This can be addressed either by "cleaning up" the sas-4 mutant chromosome by backcrossing to wild-type lines, allowing recombination to occur and replace the potential second site mutations, or by using transallelic combinations of sas-4, as they did for sas-6. The "easier" option may just be to analyse all the phenotypes with the sas-6 transallelic combination.
      2. Using "whole animal" mutants for assessing neuronal morphology is risky due to non-cell-autonomous effects. The authors have carried out some phenotypic analysis of neurons depleted of Sas-4 by cell-specific RNAi, but I feel they need to do this for all of their analysis. This includes embryonic lethality measures, quantification of centrosome numbers, and all axonal phenotypes in Sas-4 RNAi neurons. It would also be prudent to use 2 distinct RNAi lines to help ensure any phenotypes are not off-target effects (and this may help clarify why the authors see some additional phenotypes with RNAi). Indeed, there are relatively weak phenotypes in muscles when using RNAi compared to the mutants and these potential non-cell-autonomous effects could then have a knock-on effect on neuronal morphology. If the authors were concerned that RNAi is not very efficient (explaining any potential weaker phenotypes than in mutants) the authors could examine the effectiveness of RNAi lines by analysing protein depletion by western blotting or mRNA depletion by rt-qPCR (although this has to be done in a different cell type due to the difficulty in obtaining a neuronal extract).
      3. When analysing centriole presence or absence it is a good idea to stain with two different centriole markers e.g. Asl and Plp. This helps rule out unspecific staining. It is clear from the images that similar sized foci can be observed outside of the cells (see Figure 5A for example), so clearly some of the foci that appear to be within the cells may also be unspecific staining.
      4. The evidence for active centrosomes is not that convincing. Acetylated tubulin is associated with stable MTs, which are not normally organised by "active" centrosomes that nucleate dynamic microtubules. Moreover, it is plausible that centriole foci happen to overlap with the acetylated tubulin staining by chance. This would explain why not all centrosomes colocalise with acetylated tubulin signal. The authors could better test centrosome activity by performing live imaging with EB1-GFP. If centrosomes are active, it is very easy to observe the many comets produced by the centrosomes.
      5. If the authors believe that centrosomes have a role in axon pathfinding in sensory neurons, they should show that these centrosomes are active, at least during early stages (again using EB1-GFP imaging).
      6. The authors mention in the discussion that "increased JNK activity, can result in axonal wiggliness (Karkali et al, 2023)". I therefore wonder whether centrosome loss may induce JNK activation (the stress response), as this would then indicate an indirect effect of centrosome loss on axonal structure rather than a direct influence of centrosome-generated microtubules. The authors could assess whether the DNK-JNK pathway is activated in neurons lacking centrosomes by expression UAS-Puc-GFP and quantifying the nuclear signal.
      7. In Figure 5, the authors claim that they find "a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". I don't think this is a strong correlation. The difference in centriole number between embryos with no defects and those with defects is very small. In contrast, the difference between centriole numbers in control (no defects) and mutant (no defects) is very large. So, there does not appear to be a strong correlation between centrosome number and phenotype.

      Minor comments

      1. I don't understand Figure 3C - why do the % of surviving homozygotes and heterozygotes add up to 100%? Should the grey boxes not relate to dead and the white to surviving?
      2. "In mouse fibroblasts, myoblasts and endothelial cells, centrosome orientation is important for nuclear positioning and cell migration(Chang et al, 2015; Gomes et al, 2005; Kushner et al, 2014)." Do you mean "centrosome position"?
      3. In the introduction, the authors mention Meka et al. when saying the centrosomal microtubules are important for axonal development, but they should also discuss the counter argument from Vinopal et al., 2023 (Neuron) that showed how centrosomes were required for neuronal migration but not axon growth, which was instead mediated by Golgi-derived microtubules.
      4. Lines 228-230 - repeated sentence
      5. Additionally, we did not detect centrioles in the quadrant opposite the axon exit point (Fig. 2B n=75) - this data is not in Fig 2B
      6. "This significant decrease in the humber of centrioles further supports the critical role of Sas-4 in pioneer neurons of the ventral nerve cord (VNC) during Drosophila embryogenesis". It rather highlights that Sas-4 is required for centriole formation in these neurons. Also, humber = number.
      7. Result title: Non-ciliated sensory neurons have centrioles. This is kind of obvious. A better title may be "axon phenotypes correlate with centriole numbers in sensory neurons" but unfortunately i don't think there is good evidence for this (See major point above).

      Significance

      As mentioned above, the advance will be important if more evidence is provided. In this case, the paper will be interesting to a broad readership. But currently the paper is limited by the lack of evidence for centrosome function and activity in the neurons.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Reviews):

      Summary:

      Argunşah et al. describe and investigate the mechanisms underlying the differential response dynamics of barrel vs septa domains of the whisker-related primary somatosensory cortex (S1). Upon repeated stimulation, the authors report that the response ratio between multi- and single-whisker stimulation increases in layer (L) 4 neurons of the septal domain, while remaining constant in barrel L4 neurons. This difference is attributed to the short-term plasticity properties of interneurons, particularly somatostatin-expressing (SST+) neurons. This claim is supported by the increased density of SST+ neurons found in L4 of the septa compared to barrels, along with a stronger response of (L2/3) SST+ neurons to repeated multi- vs single-whisker stimulation. The role of the synaptic protein Elfn1 is then examined. Elfn1 KO mice exhibited little to no functional domain separation between barrel and septa, with no significant difference in single- versus multi-whisker response ratios across barrel and septal domains. Consistently, a decoder trained on WT data fails to generalize to Elfn1 KO responses. Finally, the authors report a relative enrichment of S2- and M1-projecting cell densities in L4 of the septal domain compared to the barrel domain.

      Strengths:

      This paper describes and aims to study a circuit underlying differential response between barrel columns and septal domains of the primary somatosensory cortex. This work supports the view that barrel and septal domains contribute differently to processing single versus multi-whisker inputs, suggesting that the barrel cortex multiplexes sensory information coming from the whiskers in different domains.

      We thank the reviewer for the very neat summary of our findings that barrel cortex multiplexes converging information in separate domains.

      Weaknesses:

      While the observed divergence in responses to repeated SWS vs MWS between the barrel and septal domains is intriguing, the presented evidence falls short of demonstrating that short-term plasticity in SST+ neurons critically underpins this difference. The absence of a mechanistic explanation for this observation limits the work’s significance. The measurement of SST neurons’ response is not specific to a particular domain, and the Elfn1 manipulation does not seem to be specific to either stimulus type or a particular domain.

      We appreciate the reviewer’s perspective. Although further research is needed to understand the circuit mechanisms underlying the observed phenomenon, we believe our data suggest that altering the short-term dynamics of excitatory inputs onto SST neurons reduces the divergent spiking dynamics in barrels versus septa during repetitive single- and multi-whisker stimulation. Future work could examine how SST neurons, whose somata reside in barrels and septa, respond to different whisker stimuli and the circuits in which they are embedded. At this time, however, the authors believe there is no alternative way to test how the short-term dynamics of excitatory inputs onto SST neurons, as a whole, contribute to the temporal aspects of barrel versus septa spiking.

      The study's reach is further constrained by the fact that results were obtained in anesthetized animals, which may not generalize to awake states.

      We appreciate the reviewer’s concern regarding the generalizability of our findings from anesthetized animals to awake states. Anesthesia was employed to ensure precise individual whisker stimulation (and multi-whisker in the same animal), which is challenging in awake rodents due to active whisking. While anesthesia may alter higher-order processing, core mechanisms, such as short and long term plasticity in the barrel cortex, are preserved under anesthesia (Martin-Cortecero et al., 2014; Mégevand et al., 2009).

      The statistical analysis appears inappropriate, with the use of repeated independent tests, dramatically boosting the false positive error rate.

      Thank you for your feedback on our analysis using independent rank-based tests for each time point in wild-type (WT) animals. To address concerns regarding multiple comparisons and temporal dependencies (for Figure 1F and 4D for now but we will add more in our revision), we performed a repeated measures ANOVA for WT animals (13 Barrel, 8 Septa, 20 time points), which revealed a significant main effect of Condition (F(1,19) = 16.33, p < 0.001) and a significant Condition-Time interaction (F(19,361) = 2.37, p = 0.001). Post-hoc tests confirmed significant differences between Barrel and Septa at multiple time points (e.g., p < 0.0025 at times 3, 4, 6, 7, 8, 10, 11, 12, 16, 19 after Bonferroni posthoc correction), supporting a differential multi-whisker vs. single-whisker ratio response in WT animals. In contrast, a repeated measures ANOVA for knock-out (KO) animals (11 Barrel, 7 Septa, 20 time points) showed no significant main effect of Condition (F(1,14) = 0.17, p = 0.684) or Condition-Time interaction (F(19,266) = 0.73, p = 0.791), indicating that the BarrelSepta difference observed in WT animals is absent in KO animals.

      Furthermore, the manuscript suffers from imprecision; its conclusions are occasionally vague or overstated. The authors suggest a role for SST+ neurons in the observed divergence in SWS/MWS responses between barrel and septal domains. However, this remains speculative, and some findings appear inconsistent. For instance, the increased response of SST+ neurons to MWS versus SWS is not confined to a specific domain. Why, then, would preferential recruitment of SST+ neurons lead to divergent dynamics between barrel and septal regions? The higher density of SST+ neurons in septal versus barrel L4 is not a sufficient explanation, particularly since the SWS/MWS response divergence is also observed in layers 2/3, where no difference in SST+ neuron density is found.

      Moreover, SST+ neuron-mediated inhibition is not necessarily restricted to the layer in which the cell body resides. It remains unclear through which differential microcircuits (barrel vs septum) the enhanced recruitment of SST+ neurons could account for the divergent responses to repeated SWS versus MWS stimulation.

      We fully appreciate the reviewer’s comment. We currently do not provide any evidence on the contribution of SST neurons in the barrels versus septa in layer 4 on the response divergence of spiking observed in SWS versus MWS. We only show that these neurons differentially distribute in the two domains in this layer. It is certainly known that there is molecular and circuit-based diversity of SST-positive neurons in different layers of the cortex, so it is plausible that this includes cells located in the two domains of vS1, something which has not been examined so far. Our data on their distribution are one piece of information that SST neurons may have a differential role in inhibiting barrel stellate cells versus septa ones. Morphological reconstructions of SST neurons in L4 of the somatosensory barrel cortex has shown that their dendrites and axons project locally and may confine to individual domains, even though not specifically examined (Fig. 3 of Scala F et al., 2019). The same study also showed that L4 SST cells receive excitatory input from local stellate cells) and is known that they are also directly excited by thalamocortical fibers (Beierlein et al., 2003; Tan et al., 2008), both of which facilitate.

      As shown in our supplementary figure, the divergence is also observed in L2/3 where, as the reviewer also points out, where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains columns- in sensory cortices.

      Regardless of the mechanism, the Elfn1 knock-out mouse line almost exclusively affects the incoming excitability onto SST neurons (see also reply to comment below), hence what can be supported by our data is that changing the incoming short-term synaptic plasticity onto these neurons brings the spiking dynamics between barrels and septa closer together.

      The Elfn1 KO mouse model seems too unspecific to suggest the role of the short-term plasticity in SST+ neurons in the differential response to repeated SWS vs MWS stimulation across domains. Why would Elfn1-dependent short-term plasticity in SST+ neurons be specific to a pathway, or a stimulation type (SWS vs MWS)? Moreover, the authors report that Elfn1 knockout alters synapses onto VIP+ as well as SST+ neurons (Stachniak et al., 2021; previous version of this paper)-so why attribute the phenotype solely to SST+ circuitry? In fact, the functional distinctions between barrel and septal domains appear largely abolished in the Elfn1 KO.

      Previous work by others and us has shown that globally removing Elfn1 selectively removes a synaptic process from the brain without altering brain anatomy or structure. This allows us to study how the temporal dynamics of inhibition shape activity, as opposed to inhibition from particular cell types. We will nevertheless update the text to discuss more global implications for SST interneuron dynamics and include a reference to VIP interneurons that contain Elfn1.

      When comparing SWS to MWS, we find that MWS replaces the neighboring excitation which would normally be preferentially removed by short-term plasticity in SST interneurons, thus providing a stable control comparison across animals and genotypes. On average, VIP interneurons failed to show modulation by MWS. We were unable to measure a substantial contribution of VIP cells to this process and also note that the Elfn1 expressing multipolar neurons comprise only ~5% of VIP neurons (Connor and Peters, 1984; Stachniak et al., 2021), a fraction that may be lost when averaging from 138 VIP cells. Moreover, the effect of Elfn1 loss on VIP neurons is quite different and marginal compared to that of SST cells, suggesting that the primary impact of Elfn1 knockout is mediated through SST+ interneuron circuitry. Therefore, even if we cannot rule out that these 5% of VIP neurons contribute to barrel domain segregation, we are of the opinion that their influence would be very limited if any.

      Reviewer #2 (Public Reviews):

      Summary:

      Argunsah and colleagues demonstrate that SST-expressing interneurons are concentrated in the mouse septa and differentially respond to repetitive multi-whisker inputs. Identifying how a specific neuronal phenotype impacts responses is an advance.

      Strengths:

      (1)  Careful physiological and imaging studies.

      (2)  Novel result showing the role of SST+ neurons in shaping responses.

      (3)  Good use of a knockout animal to further the main hypothesis.

      (4)  Clear analytical techniques.

      We thank the reviewer for their appreciation of the study.

      Weaknesses:

      No major weaknesses were identified by this reviewer. Overall, I appreciated the paper but feel it overlooked a few issues and had some recommendations on how additional clarifications could strengthen the paper. These include:

      (1) Significant work from Jerry Chen on how S1 neurons that project to M1 versus S2 respond in a variety of behavioral tasks should be included (e.g. PMID: 26098757). Similarly, work from Barry Connor’s lab on intracortical versus thalamocortical inputs to SST neurons, as well as excitatory inputs onto these neurons (e.g. PMID: 12815025) should be included.

      We thank the reviewer for these valuable resources that we overlooked. We will include Chen et al. (2015), Cruikshank et al. (2007) and Gibson et al. (1999) to contextualize S1 projections and SST+ inputs, strengthening the study’s foundation as well as Beierlein et al. (2003) which nicely show both local and thalamocortical facilitation of excitatory inputs onto L4 SST neurons, in contrast to PV cells. The paper also shows the gradual recruitment of SST neurons by thalamocortical inputs to provide feed-forward inhibition onto stellate cells (regular spiking) of the barrel cortex L4 in rat.

      (2) Using Layer 2/3 as a proxy to what is happening in layer 4 (~line 234). Given that layer 2/3 cells integrate information from multiple barrels, as well as receiving direct VPm thalamocortical input, and given the time window that is being looked at can receive input from other cortical locations, it is not clear that layer 2/3 is a proxy for what is happening in layer 4.

      We agree with the reviewer that what we observe in L2/3 is not necessarily what is taking place in L4 SST-positive cells. The data on L2/3 was included to show that these cells, as a population, can show divergent responses when it comes to SWS vs MWS, which is not seen in L2/3 VIP neurons. Regardless of the mechanisms underlying it, our overall data support that SST-positive neurons can change their activation based on the type of whisker stimulus and when the excitatory input dynamics onto these neurons change due to the removal of Elfn1 the recruitment of barrels vs septa spiking changes at the temporal domain. Having said that, the data shown in Supplementary Figure 3 on the response properties of L2/3 neurons above the septa vs above the barrels (one would say in the respective columns) do show the same divergence as in L4. This suggests that a circuit motif may exist that is common to both layers, involving SST neurons that sit in L4, L5 or even L2/3. This implies that despite the differences in the distribution of SST neurons in septa vs barrels of L4 there is an unidentified input-output spatial connectivity motif that engages in both L2/3 and L4. Please also see our response to a similar point raised by reviewer 1.

      (3) Line 267, when discussing distinct temporal response, it is not well defined what this is referring to. Are the neurons no longer showing peaks to whisker stimulation, or are the responses lasting a longer time? It is unclear why PV+ interneurons which may not be impacted by the Elfn1 KO and receive strong thalamocortical inputs, are not constraining activity.

      We thank the reviewer for their comment and will clarify the statement.

      This convergence of response profiles was further clear in stimulus-aligned stacked images, where the emergent differences between barrels and septa under SWS were largely abolished in the KO (Figure 4B). A distinction between directly stimulated barrels and neighboring barrels persisted in the KO. In addition, the initial response continued to differ between barrel and septa and also septa and neighbor (Figure 4B). This initial stimulus selectivity potentially represents distinct feedforward thalamocortical activity, which includes PV+ interneuron recruitment that is not directly impacted by the Elfn1 KO (Sun et al., 2006; Tan et al., 2008). PV+ cells are strongly excited by thalamocortical inputs, but these exhibit short-term depression, as does their output, contrasting with the sustained facilitation observed in SST+ neurons. These findings suggest that in WT animals, activity spillover from principal barrels is normally constrained by the progressive engagement of SST+ interneurons in septal regions, driven by Elfn1-dependent facilitation at their excitatory synapses. In the absence of Elfn1, this local inhibitory mechanism is disrupted, leading to longer responses in barrels, delayed but stronger responses in septa, and persistently stronger responses in unstimulated neighbors, resulting in a loss of distinction between the responses of barrel and septa domains that normally diverge over time (see Author response image 1 below).

      Author response image 1.

      (A) Barrel responses are longer following whisker stimulation in KO. (B) Septal responses are slightly delayed but stronger in KO. (C) Unstimulated neighbors show longer persistent responses in KO.

       

      (4) Line 585 “the earliest CSD sink was identified as layer 4…” were post-hoc measurements made to determine where the different shank leads were based on the post-hoc histology?

      Post hoc histology was performed on plane-aligned brain sections which would allow us to detect barrels and septa, so as to confirm the insertion domains of each recorded shank. Layer specificity of each electrode therefore could therefore not be confirmed by histology as we did not have coronal sections in which to measure electrode depth.

      (5) For the retrograde tracing studies, how were the M1 and S2 injections targeted (stereotaxically or physiologically)? How was it determined that the injections were in the whisker region (or not)?

      During the retrograde virus injection, the location of M1 and S2 injections was determined by stereotaxic coordinates (Yamashita et al., 2018). After acquiring the light-sheet images, we were able to post hoc examine the injection site in 3D and confirm that the injections were successful in targeting the regions intended. Although it would have been informative to do so, we did not functionally determine the whisker-related M1 and whisker-related S2 region in this experiment.

      (6) Were there any baseline differences in spontaneous activity in the septa versus barrel regions, and did this change in the KO animals?

      Thank you for this interesting question. Our previous study found that there was a reduction in baseline activity in L4 barrel cortex of KO animals at postnatal day (P)12, but no differences were found at P21 (Stachniak et al., 2023).

      Reviewer #3 (Public Reviews):

      Summary:

      This study investigates the functional differences between barrel and septal columns in the mouse somatosensory cortex, focusing on how local inhibitory dynamics, particularly involving Elfn1-expressing SST⁺ interneurons, may mediate temporal integration of multiwhisker (MW) stimuli in septa. Using a combination of in vivo multi-unit recordings, calcium imaging, and anatomical tracing, the authors propose that septa integrate MW input in an Elfn1-dependent manner, enabling functional segregation from barrel columns.

      Strengths:

      The core hypothesis is interesting and potentially impactful. While barrels have been extensively characterized, septa remain less understood, especially in mice, and this study's focus on septal integration of MW stimuli offers valuable insights into this underexplored area. If septa indeed act as selective integrators of distributed sensory input, this would add a novel computational role to cortical microcircuits beyond what is currently attributed to barrels alone. The narrative of this paper is intellectually stimulating.

      We thank the reviewer for finding the study intellectually stimulating.

      Weaknesses:

      The methods used in the current study lack the spatial and cellular resolution needed to conclusively support the central claims. The main physiological findings are based on unsorted multi-unit activity (MUA) recorded via low-channel-count silicon probes. MUA inherently pools signals from multiple neurons across different distances and cell types, making it difficult to assign activity to specific columns (barrel vs. septa) or neuron classes (e.g., SST⁺ vs. excitatory).

      The recording radius (~50-100 µm or more) and the narrow width of septa (~50-100 µm or less) make it likely that MUA from "septal" electrodes includes spikes from adjacent barrel neurons.

      The authors do not provide spike sorting, unit isolation, or anatomical validation that would strengthen spatial attribution. Calcium imaging is restricted to SST⁺ and VIP⁺ interneurons in superficial layers (L2/3), while the main MUA recordings are from layer 4, creating a mismatch in laminar relevance.

      We thank the reviewer for pointing out the possibility of contamination in septal electrodes. Importantly, it may not have been highlighted, although reported in the methods, but we used an extremely high threshold (7.5 std, in methods, line 583) for spike detection in order to overcome the issue raised here, which restricts such spatial contaminations. Since the spike amplitude decays rapidly with distance, at high thresholds, only nearby neurons contribute to our analysis, potentially one or two. We believe that this approach provides a very close approximation of single unit activity (SUA) in our reported data. We will include a sentence earlier in the manuscript to make this explicit and prevent further confusion.

      Regarding the point on calcium imaging being performed on L2/3 SST and VIP cells instead of L4. Both reviewer 1 and 2 brought up the same issue and we responded as follows. As shown in our supplementary figure, the divergence is also observed in L2/3 where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains -columns- in sensory cortices.

      Furthermore, while the role of Elfn1 in mediating short-term facilitation is supported by prior studies, no new evidence is presented in this paper to confirm that this synaptic mechanism is indeed disrupted in the knockout mice used here.

      We thank Reviewer #3 for noting the absence of new evidence confirming Elfn1’s disruption of short-term facilitation in our knockout mice. We acknowledge that our study relies on previously strong published data demonstrating that Elfn1 mediates short-term synaptic facilitation of excitatory inputs onto SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023). These studies consistently show that Elfn1 knockout abolishes facilitation in SST+ synapses, leading to altered temporal dynamics, which we hypothesize underlies the observed loss of barrel-septa response divergence in our Elfn1 KO mice (Figure 4). Nevertheless, to address the point raised, we will clarify in the revised manuscript (around lines 245-247 and 271-272) that our conclusions are based on these established findings, stating: “Building on prior evidence that Elfn1 knockout disrupts short-term facilitation in SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023), we attribute the abolished barrel-septa divergence in Elfn1 KO mice to altered SST+ synaptic dynamics, though direct synaptic measurements were not performed here.”

      Additionally, since Elfn1 is constitutively knocked out from development, the possibility of altered circuit formation-including changes in barrel structure and interneuron distribution, cannot be excluded and is not addressed.

      We thank Reviewer #3 for raising the valid concern that constitutive Elfn1 knockout could potentially alter circuit formation, including barrel structure and interneuron distribution. To address this, we will clarify in the revised manuscript (around line ~271 and in the Discussion) that in our previous studies that included both whole-cell patch-clamp in acute brain slices ranging from postnatal day 11 to 22 (P11 - P21) and in vivo recordings from barrel cortex at P12 and P21, we saw no gross abnormalities in barrel structure, with Layer 4 barrels maintaining their characteristic size and organization, consistent with wildtype (WT) mice (Stachniak et al., 2019, 2023). While we cannot fully exclude subtle developmental changes, prior studies indicate that Elfn1 primarily modulates synaptic function rather than cortical cytoarchitecture (Tomioka et al., 2014). Elfn1 KO mice show no gross morphological or connectivity differences and the pattern and abundance of Elfn1 expressing cells (assessed by LacZ knock in) appears normal (Dolan and Mitchell, 2013).

      We will add the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013).

      Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without the usage of time-depended conditional knockout of the gene.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) My biggest concern is regarding statistics. Did the authors repeatedly apply independent tests (Mann-Whitney) without any correction for multiple comparisons (Figures 1 and 4)? In that case, the chances of a spurious "significant" result rise dramatically. 

      In response to the reviewer’s comment, we now present new statistical results by utilizing ANOVA and blended these results in the manuscript between lines 172 and 192 for WT data and 282 and 298 for Elfn1 KO data. This new statistical approach shows the same differences as we had previously reported, hence consolidating the statements made. 

      (2) The findings only hint at a mechanism involving SST+ neurons for how SWS and MWS are processed differently in the barrel vs septal domains. As a direct test of SST+ neuron involvement in the divergence of barrel and septal responses, the authors might consider SST-specific manipulations - for example, inhibitory chemo- or optogenetics during SWS and MWS stimulation.

      We thank the reviewer for this comment and agree that a direct manipulation of SST+ neurons via inhibitory chemo- or opto-genetics could provide further supporting evidence for the main claims in our study. We have opted out from performing these experiments for this manuscript as we feel they can be part of a future study.  At the same time, it is conceivable that such manipulations and depending on how they are performed may lead to larger and non-specific effects on cortical activity, since SST neurons will likely be completely shut down. So even though we certainly appreciate and value the strengths of such approaches, our experiments have addressed a more nuanced hypothesis, namely that the synaptic dynamics onto SST+ neurons matter for response divergence of septa versus barrels, which could not have been easily and concretely addressed by manipulating SST+ cell firing activity.  

      (3) In general, it is hard to comprehend what microcircuit could lead to the observed divergence in the MWS/SWS ratio in the barrel vs septal domain. There preferential recruitment of SST+ neurons during MWS is not specific to a particular domain, and the higher density of SST+ neurons specifically in L4 septa cannot per se explain the diverging MWS/SWS ratio in L4 septal neurons since similar ratio divergence is observed across domains in L2/3 neurons without increase SST+ neuron density in L2/3. This view would also assume that SST+ inhibition remains contained to its own layer and domain. Is this the case? Is it that different microcircuits between barrels and septa differently shape the response to repeated MWS? This is partially discussed in the paper; can the authors develop on that? What would the proposed mechanism be? Can the short-term plasticity of the thalamic inputs (VPM vs POm) be part of the picture?

      We thank the reviewer for raising this important point. We propose that the divergence in MWS/SWS ratios across barrel and septal domains arises from dynamic microcircuit interactions rather than static anatomical features such as SST+ density, which we describe and can provide a hint. In L2/3, where SST+ density is uniform, divergence persists, suggesting that trans-laminar and trans-domain interactions are key. Barrel domains, primarily receiving VPM inputs, exhibit short-term depression onto excitatory cells and engage PV+ and SST+ neurons to stabilize the MWS/SWS ratio, with Elfn1-dependent facilitation of SST+ neurons gradually increasing inhibition during repetitive SWS. Septal domains, in contrast, are targeted by facilitating POm inputs, combined with higher L4 SST+ density and Elfn1-mediated facilitation, producing progressive inhibitory buildup that amplifies the MWS/SWS ratio. SST+ projections in septa may extend trans-laminarly and laterally, influencing L2/3 and neighboring barrels, thereby explaining L2/3 divergence despite uniform SST+ density in L2/3. In this regards, direct laminar-dependent manipulations will be required to confirm whether L2/3 divergence is inherited from L4 dynamics. In Elfn1 KO mice, the loss of facilitation in SST+ neurons likely flattens these dynamics, disrupting functional segregation. Future experiments using VPM/POm-specific optogenetic activation and SST+ silencing will be critical to directly test this model.

      We expanded the discussion accordingly.

      (4) Can the decoder generalize between SWS and MWS? In this condition, if the decoder accuracy is higher for barrels than septa, it would support the idea that septa are processing the two stimuli differently. 

      Our results show that septal decoding accuracy is generally higher than barrel accuracy when generalizing from multi-whisker stimulation (MWS) to single-whisker stimulation (SWS), indicating distinct information processing in septa compared to barrels.

      In wild-type (WT) mice, septal accuracy exceeds barrel accuracy across all time windows (150ms, 51-95ms, 1-95ms), with the largest difference in the 51-95ms window (0.9944 vs. 0.9214 at pulse 20, 10Hz stimulation). This septal advantage grows with successive pulses, reflecting robust, separable neural responses, likely driven by the posterior medial nucleus (POm)’s strong MWS integration contrasting with minimal SWS activation. Barrel responses, driven by consistent ventral posteromedial nucleus (VPM) input for both stimuli, are less distinguishable, leading to lower accuracy.

      In Elfn1 knockout (KO) mice, which disrupt excitatory drive to somatostatin-positive (SST+) interneurons, barrel accuracy is higher initially in the 1-50ms window (0.8045 vs. 0.7500 at pulse 1), suggesting reduced early septal distinctiveness. However, septal accuracy surpasses barrels in later pulses and time windows (e.g., 0.9714 vs. 0.9227 in 51-95ms at pulse 20), indicating restored septal processing. This supports the role of SST+ interneurons in shaping distinct MWS responses in septa, particularly in late-phase responses (51-95ms), where inhibitory modulation is prominent, as confirmed by calcium imaging showing stronger SST+ activation during MWS.

      These findings demonstrate that septa process SWS and MWS differently, with higher decoding accuracy reflecting structured, POm- and SST+-driven response patterns. In Elfn1 KO mice, early deficits in septal processing highlight the importance of SST+ interneurons, with later recovery suggesting compensatory mechanisms. 

      We have added Supplementary Figure 4 and included this interpretation between lines 338353. 

      We thank the reviewer for suggesting this analysis.

      (5) It is not clear to me how the authors achieve SWS. How is it that the pipette tip "placed in contact with the principal whisker" does not detach from the principal whisker or stimulate other whiskers? Please clarify the methods. 

      Targeting the specific principal whisker is performed under the stereoscope.  

      Specifically, we have added this statement in line 628:

      “We trimmed the whiskers where necessary, to avoid them touching each other and to avoid stimulating other whiskers. By putting the pipette tip very close (almost touching) to the principal whisker, the movement of the tip (limited to 1mm) would reliably move the targeted whisker. The specificity of the stimulation of the selected principal whisker was observed under the stereoscope.”

      (6) The method for calculating decoder accuracy is not clearly described-how can accuracy exceed 1? The authors should clarify this metric and provide measures of variability (e.g., confidence intervals or standard deviations across runs) to assess the significance of their comparisons. Additionally, using a consistent scale across all plots would improve interoperability. 

      We thank the reviewer for raising this point. We have now changed the way accuracies are calculated and adopted a common scale among different plots (see updated Figure 5). We have also changed the methods section accordingly.

      (7) Figure 1: The sample size is not specified. It looks like the numbers match the description in the methods, but the sample size should be clearly stated here. 

      These are the numbers the reviewer is inquiring about. 

      WT: (WT) animals: a 280 × 95 × 20 matrix for the stimulated barrel (14 Barrels, 95ms, 20 pulses), a 180 × 95 × 20 matrix for the septa (9 Septa, 95ms, 20 pulses), and a 360 × 95 × 20 matrix for the neighboring barrel (18 Neighboring barrels, 95ms, 20 pulses). N=4 mice.

      KO: 11-barrel columns, 7 septal columns, 11 unstimulated neighbors from N=4 mice.

      Panels D-F are missing axes and axis labels (firing rate, p-value). Panel D is mislabeled (left, middle, and right). I can't seem to find the yellow line. 

      Thank you for this observation. We made changes in the figures to make them easier to navigate based on the collective feedback from the reviewers.

      Why is changing the way to compare the differences in the responses to repeated stimulation between SWS and MWS? 

      To assess temporal accumulation of information, we compared responses to repeated single-whisker stimulation (SWS) and multi-whisker stimulation (MWS) using an accumulative decoding approach rather than simple per-pulse firing rates. This method captures domain-specific integration dynamics over successive pulses.

      The use of the term "principal whisker" is confusing, as it could refer to the whisker that corresponds to the recorded barrel. 

      When we use the term principal whisker, the intention is indeed to refer to the whisker corresponding to the recorded barrel during single whisker stimulation. The term principal whisker is removed from Figure legend 1 and legend S1C where it may have led to  ambiguity.    

      Why the statement "after the start of active whisking"? Mice are under anesthesia here; it does not appear to be relevant for the figure. 

      “After the start of active whisking” refers to the state of the barrel cortex circuitry at the time of recordings. The particular reference we use comes from the habit of assessing sensory processing also from a developmental point of view. The reviewer is correct that it has nothing to do the with the status of the experiment. Nevertheless, since the reviewer found that it may create confusion, we have now taken it out. 

      (8) Figure 3: The y-axis label is missing for panel C. 

      This is now fixed. (dF/F).

      (9) Figure 4: Axis labels are missing.

      Added.

      Minor: 

      (10) Line 36: "progressive increase in septal spiking activity upon multi-whisker stimulation". There is no increase in septal spiking activity upon MWS; the ratio MWS/SWS increases.

      We have changed the sentence as follows: Genetic removal of Elfn1, which regulates the incoming excitatory synaptic dynamics onto SST+ interneurons, leads to the loss of the progressive increase in septal spiking ratio (MWS/SWS) upon stimulation.

      (11) Line 105: domain-specific, rather than column-specific, for consistency.

      We have changed it.

      (12) Lines 173-174: "a divergence between barrel and septa domain activity also occurred in Layer 4 from the 2nd pulse onward (Figure 1E)". The authors only show a restricted number of comparisons. Why not show the p-values as for SWS?

      The statistics is now presented in current Figure 1E.

      (13) Lines 151-153: "Correspondingly, when a single whisker is stimulated repeatedly, the response to the first pulse is principally bottom-up thalamic-driven responses, while the later pulses in the train are expected to also gradually engage cortico-thalamo-cortical and cortico-cortical loops." Can the authors please provide a reference?

      We have now added the following references : (Kyriazi and Simons, 1993; Middleton et al., 2010; Russo et al., 2025).

      (14) Lines 184-186: "Our electrophysiological experiments show a significant divergence of responses over time upon both SWS and MWS in L4 between barrels (principal and neighboring) and adjacent septa, with minimal initial difference". The only difference between the neighboring barrel and septa is the responses to the initial pulse. Can the author clarify? 

      We have now changed the sentence as follows: Our electrophysiological experiments show a significant divergence of responses between domains upon both SWS and MWS in L4. (Line 198 now)

      (15) Line 214: "suggest these interneurons may play a role in diverging responses between barrels and septa upon SWS". Why SWS specifically?

      We have changed the sentence as follows: These results confirmed that SST+ and VIP+ interneurons have higher densities in septa compared to barrels in L4 and suggest these interneurons may play a role in diverging responses between barrels and septa. (Line 231 now).

      (16) Line 235: "This result suggests that differential activation of SST+ interneurons is more likely to be involved in the domain-specific temporal ratio differences between barrels and septa". Why? The results here are not domain-specific.

      We have now revised this statement to: This result suggested that temporal ratio differences specific to barrels and septa might involve differential activation of SST+ interneurons rather than VIP+ interneurons.

      (17) Lines 241-243: "SST+ interneurons in the cortex are known to show distinct short-term synaptic plasticity, particularly strong facilitation of excitatory inputs, which enables them to regulate the temporal dynamics of cortical circuits." Please provide a reference.

      We have now added the following references: (Grier et al., 2023; Liguz-Lecznar et al., 2016).

      (18) Lines 245-247: "A key regulator of this plasticity is the synaptic protein Elfn1, which mediates short-term synaptic facilitation of excitation on SST+ interneurons (Stachniak et al., 2021, 2019; Tomioka et al., 2014)". Is Stachniak et al., 2021 not about the role of Elf1n in excitatory-to-VIP+ neuron synapses?

      The reviewer correctly spotted this discrepancy . This reference has now been removed from this statement.

      (19) Lines 271-272: "Building on our findings that Elfn1-dependent facilitation in SST+ interneurons is critical for maintaining barrel-septa response divergence". The authors did not show that.

      We have now changed the statement to: Building on our findings that Elfn1 is critical for maintaining barrel-septa response divergence  

      (20) Line 280: second firing peak, not "peal".

      Thank you, it is now fixed.

      (21) Lines 304-305: "These results highlight the critical role of Elfn1 in facilitating the temporal integration of 305 sensory inputs through its effects on SST+ interneurons". This claim is also overstated. 

      We have now changed the statement to: These results highlight the contribution of Elfn1 to the temporal integration of sensory inputs. (Line 362)

      (22) Line 329: Any reason why not cite Chen et al., Nature 2013?

      We have now added this reference, as also pointed out by reviewer 1.

      (23) Line 341-342: "wS1" and "wS2" instead of S1 and S2 for consistency.

      Thanks, we have now updated the terms.

      Reviewer #2 (Recommendations for the authors): 

      (1) Figure 3D - the SW conditions are labeled but not the MW conditions (two right graphs) - they should be labeled similarly (SSTMW, VIPMW). 

      The two right graphs in Figure 3D represent paired SW vs MW comparisons of the evoked responses for SST and VIP populations, respectively.

      (2) Figure 6 D and E I think it would be better if the Depth measurements were to be on the yaxis, which is more typical of these types of plots. 

      We thank the reviewer for this comment. Although we appreciate this may be the case, we feel that the current presentation may be easier for the reader to navigate, and we have hence kept it. 

      (3) Having an operational definition of septa versus barrel would be useful. As the authors point out, this is a tough distinction in a mouse, and often you read papers that use Barrel Wall versus Barrel Hollow/Center - operationally defining how these areas were distinguished would be helpful. 

      We thank the reviewer for this comment and understand the point made.

      We have now updated the methods section in line 611: 

      DiI marks contained within the vGlut2 staining were defined as barrel recordings, while DiI marks outside vGlut2 staining were septal recordings.

      Reviewer #3 (Recommendations for the authors): 

      To support the manuscript's major claims, the authors should consider the following:

      (1) Validate the septal identity of the neurons studied, either anatomically or functionally at the single-cell level (e.g., via Ca²⁺ imaging with confirmed barrel/septa mapping). 

      We thank the reviewer for this suggestion, but we feel that these extensive experiments are beyond the scope of this study. 

      (2) Provide both anatomical and physiological evidence to assess the possibility of altered cortical development in Elfn1 KO mice, including potential changes in barrel structure or SST⁺ cell distribution. 

      To address the reviewer’s point, we have now added the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013). Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without conditional knockouts.”,

      (3) Examine the sensory responses of SST⁺ and VIP⁺ interneurons in deeper cortical layers, particularly layer 4, which is central to the study's main conclusions.

      We thank the reviewer for this suggestion and appreciate the value it would bring to the study. We nevertheless feel that these extensive experiments are beyond the scope of this study and hence opted out from performing them. 

      Minor Comments:

      (1)  The authors used a CLARITY-based passive clearing protocol, which is known to sometimes induce tissue swelling or distortion. This may affect anatomical precision, especially when assigning neurons to narrow domains such as septa versus barrels. Please clarify whether tissue expansion was measured, corrected, or otherwise accounted for during analysis.

      Yes, the tissue expansion was accounted during analysis for the laminar specification. We excluded the brains with severe distortion. 

      (2) While the anatomical data are plotted as a function of "depth from the top of layer 4," the manuscript does not specify the precise depth ranges used to define individual cortical layers in the cleared tissue. Given the importance of laminar specificity in projection and cell type analyses, the criteria and boundaries used to delineate each layer should be explicitly stated.

      Thank you for pointing this out. We now include the criteria for delineating each layer in the manuscript. “Given that the depth of Layer 4 (L4) can be reliably measured due to its welldefined barrel boundaries, and that the relative widths of other layers have been previously characterized (El-Boustani et al., 2018), we estimated laminar boundaries proportionally. Specifically, Layer 2/3 was set to approximately 1.3–1.5 times the width of L4, Layer 5a to ~0.5 times, and Layer 5b to a similar width as L4. Assuming uniform tissue expansion across the cortical column, we extrapolated the remaining laminar thicknesses proportionally.”

      (3)  In several key comparisons (e.g., SST⁺ vs. VIP⁺ interneurons, or S2-projecting vs. M1projecting neurons), it is unclear whether the same barrel columns were analyzed across conditions. Given the anatomical and functional heterogeneity across wS1 columns, failing to control for this may introduce significant confounds. We recommend analyzing matched columns across groups or, if not feasible, clearly acknowledging this limitation in the manuscript.

      We thank the reviewer for raising this important point. For the comparison of SST⁺ versus VIP⁺ interneurons, it would in principle have been possible to analyze the same barrel columns across groups. However, because some of the cleared brains did not reach the optimal level of clarity, our choice of columns was limited, and we were not always able to obtain sufficiently clear data from the same columns in both groups. Similarly, for the analysis of S2- versus M1-projecting neurons, variability in the position and spread of retrograde virus injections made it difficult to ensure measurements from identical barrel columns. We have now added a statement in the Discussion to acknowledge this limitation.

      (4) Figure 1C: Clarify what each point in the t-SNE plot represents-e.g., a single trial, a recording channel, or an averaged response. Also, describe the input features used for dimensionality reduction, including time windows and preprocessing steps.

      In response to the reviewer’s comment, we have now added the following in the methods: In summary, each point in the t-SNE plots represents an averaged response across 20 trials for a specific domain (barrel, septa, or neighbor) and genotype (WT or KO), with approximately 14 points per domain derived from the 280 trials in each dataset. The input features are preprocessed by averaging blocks of 20 trials into 1900-dimensional vectors (95ms × 20), which are then reduced to 2D using t-SNE with the specified parameters. This approach effectively highlights the segregation and clustering patterns of neural responses across cortical domains in both WT and KO conditions.

      (5) Figures 1D, E (left panels): The y-axes lack unit labeling and scale bars. Please indicate whether values are in spikes/sec, spikes/bin, or normalized units.

      We have now clarified this. 

      (6) Figures 1D, E (right panels): The color bars lack units. Specify whether the values represent raw firing rates, z-scores, or other normalized measures. Replace the vague term "Matrix representation" with a clearer label such as "Pulse-aligned firing heatmap."

      Thank you, we have now done it.

      (7) Figure 1E (bottom panel): There appears to be no legend referring to these panels. Please define labels such as "B" and "S." 

      Thank you, we have now done it.

      (8) Figure 1E legend: If it duplicates the legend from Figure 1D, this should be made explicit or integrated accordingly. 

      We have changed the structure of this figure.

      (9) Figure 1F: Define "AUC" and explain how it was computed (e.g., area under the firing rate curve over 0-50 ms). Indicate whether the plotted values represent percentages and, if so, label the y-axis accordingly. If normalization was applied, describe the procedure. Include sample sizes (n) and specify what each data point represents (e.g., animal, recording site). 

      The following paragraph has been added in the methods section:

      The Area Under the Curve (AUC) was computed as the integral of the smoothed firing rate (spikes per millisecond) over a 50ms window following each whisker stimulation pulse, using trapezoidal integration. Firing rate data for layer 4 barrel and septal regions in wild-type (WT) and knockout (KO) mice were smoothed with a 3-point moving average and averaged across blocks of 20 trials. Plotted values represent the percentage ratio of multi-whisker (MW) to single whisker (SW) AUC with error bars showing the standard error of the mean. Each data point reflects the mean AUC ratio for a stimulation pulse across approximately 11 blocks (220 trials total). The y-axis indicates percentages.

      (10) Figure 3C: Add units to the vertical axis.

      We have added them.

      (11) Figure 3D: Specify what each line represents (e.g., average of n cells, individual responses?). 

      Each line represents an average response of a neuron.  

      (12) Figure 4C legend: Same with what?". No legend refers to the bottom panels - please revise to clarify. 

      Thank you. We have now changed the figure structure and legends and fixed the missing information issue.

      (13) Supplementary Figure 1B: Indicate the physical length of the scale bar in micrometers. 

      This has been fixed. The scale bar is 250um.

      (14) Indicate the catalog number or product name of the 8×8 silicon probe used for recordings.

      We have added this information. It is the A8x8-Edge-5mm-100-200-177-A64

      References

      (1) Beierlein, M., Gibson, J. R. & Connors, B. W. (2003). Two dynamically distinct inhibitory networks in layer 4 of the neocortex. J. Neurophysiol. 90, 2987–3000.

      (2) Burkhalter, A., D’Souza, R. D. & Ji, W. (2023). Integration of feedforward and feedback information streams in the modular architecture of mouse visual cortex. Annu. Rev. Neurosci. 46, 259–280.

      (3) Chen, J. L., Margolis, D. J., Stankov, A., Sumanovski, L. T., Schneider, B. L. & Helmchen, F. (2015). Pathway-specific reorganization of projection neurons in somatosensory cortex during learning. Nat. Neurosci. 18, 1101–1108.

      (4) Connor, J. R. & Peters, A. (1984). Vasoactive intestinal polypeptide-immunoreactive neurons in rat visual cortex. Neuroscience 12, 1027–1044.

      (5) Cruikshank, S. J., Lewis, T. J. & Connors, B. W. (2007). Synaptic basis for intense thalamocortical activation of feedforward inhibitory cells in neocortex. Nat. Neurosci. 10, 462–468.

      (6) Dolan, J. & Mitchell, K. J. (2013). Mutation of Elfn1 in mice causes seizures and hyperactivity. PLoS One 8, e80491.

      (7) Gibson, J. R., Beierlein, M. & Connors, B. W. (1999). Two networks of electrically coupled inhibitory neurons in neocortex. Nature 402, 75–79.

      (8) Ji, W., Gămănuţ, R., Bista, P., D’Souza, R. D., Wang, Q. & Burkhalter, A. (2015). Modularity in the organization of mouse primary visual cortex. Neuron 87, 632–643.

      (9) Martin-Cortecero, J. & Nuñez, A. (2014). Tactile response adaptation to whisker stimulation in the lemniscal somatosensory pathway of rats. Brain Res. 1591, 27–37.

      (10) Mégevand, P., Troncoso, E., Quairiaux, C., Muller, D., Michel, C. M. & Kiss, J. Z. (2009). Long-term plasticity in mouse sensorimotor circuits after rhythmic whisker stimulation. J. Neurosci. 29, 5326–5335.

      (11) Meier, A. M., Wang, Q., Ji, W., Ganachaud, J. & Burkhalter, A. (2021). Modular network between postrhinal visual cortex, amygdala, and entorhinal cortex. J. Neurosci. 41, 4809– 4825.

      (12) Meier, A. M., D’Souza, R. D., Ji, W., Han, E. B. & Burkhalter, A. (2025). Interdigitating modules for visual processing during locomotion and rest in mouse V1. bioRxiv 2025.02.21.639505.

      (13) Scala, F., Kobak, D., Shan, S., Bernaerts, Y., Laturnus, S., Cadwell, C. R., Hartmanis, L., Froudarakis, E., Castro, J. R., Tan, Z. H., et al. (2019). Layer 4 of mouse neocortex differs in cell types and circuit organization between sensory areas. Nat. Commun. 10, 4174.

      (14) Stachniak, T. J., Sylwestrak, E. L., Scheiffele, P., Hall, B. J. & Ghosh, A. (2019). Elfn1induced constitutive activation of mGluR7 determines frequency-dependent recruitment of somatostatin interneurons. J. Neurosci. 39, 4461–4475.

      (15) Stachniak, T. J., Kastli, R., Hanley, O., Argunsah, A. Ö., van der Valk, E. G. T., Kanatouris, G. & Karayannis, T. (2021). Postmitotic Prox1 expression controls the final specification of cortical VIP interneuron subtypes. J. Neurosci. 41, 8150–8166.

      (16) Stachniak, T. J., Argunsah, A. Ö., Yang, J. W., Cai, L. & Karayannis, T. (2023). Presynaptic kainate receptors onto somatostatin interneurons are recruited by activity throughout development and contribute to cortical sensory adaptation. J. Neurosci. 43, 7101–7118.

      (17) Sun, Q.-Q., Huguenard, J. R. & Prince, D. A. (2006). Barrel cortex microcircuits: Thalamocortical feedforward inhibition in spiny stellate cells is mediated by a small number of fast-spiking interneurons. J. Neurosci. 26, 1219–1230.

      (18) Sylwestrak, E. L. & Ghosh, A. (2012). Elfn1 regulates target-specific release probability at CA1-interneuron synapses. Science 338, 536–540.

      (19) Tan, Z., Hu, H., Huang, Z. J. & Agmon, A. (2008). Robust but delayed thalamocortical activation of dendritic-targeting inhibitory interneurons. Proc. Natl. Acad. Sci. USA 105, 2187–2192.

      (20) Tomioka, N. H., Yasuda, H., Miyamoto, H., Hatayama, M., Morimura, N., Matsumoto, Y., Suzuki, T., Odagawa, M., Odaka, Y. S., Iwayama, Y., et al. (2014). Elfn1 recruits presynaptic mGluR7 in trans and its loss results in seizures. Nat. Commun. 5, 4501.

      (21) Yamashita, T., Vavladeli, A., Pala, A., Galan, K., Crochet, S., Petersen, S. S. & Petersen, C. C. (2018). Diverse long-range axonal projections of excitatory layer 2/3 neurons in mouse barrel cortex. Front. Neuroanat. 12, 33.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This important manuscript provides insights into the competition between Splicing Factor 1 (SF1) and Quaking (QKI) for binding at the ACUAA branch point sequence in a model intron, regulating exon inclusion. The study employs rigorous transcriptomic, proteomic, and reporter assays, with both mammalian cell culture and yeast models. Nevertheless, while the data are convincing, broadening the analysis to additional exons and narrowing the manuscript's title to better align with the experimental scope would strengthen the work.

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, the authors aimed to show that SF1 and QKI compete for the intron branch point sequence ACUAA and provide evidence that QKI represses inclusion when bound to it.

      Major strengths of this manuscript include:

      (1) Identification of the ACUAA-like motif in exons regulated by QKI and SF1.

      (2) The use of the splicing reporter and mutant analysis to show that upstream and downstream ACUAAC elements in intron 10 of RAI are required for repressing splicing.

      (3) The use of proteomic to identify proteins in C2C12 nuclear extract that binds to the wild type and mutant sequence.

      (4) The yeast studies showing that ectopic lethality when Qki5 expression was induced, due to increased mis-splicing of transcripts that contain the ACUAA element.

      The authors conclusively show that the ACUAA sequence is bound by QKI and provide strong evidence that this leads to differences in exons inclusion and exclusion. In animal cells, and especially in human, branchpoint sequences are degenerate but seem to be recognized by specific splicing factors. Although a subset of splicing factors shows tissue-specific expression patterns most don't, suggesting that yet-to-be-identified mechanisms regulate splicing. This work suggests that an alternate mechanism could be related to the binding affinity of specific RNA binding factors for branchpoint sequences coupled with the level of these different splicing factors in a given cell.

      We thank the reviewer for the positive comments.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Pereira de Castro and coworkers are studying potential competition between a more standard splicing factor SF1, and an alternative splicing factor called QK1. This is interesting because they bind to overlapping sequence motifs and could potentially have opposing effects on promoting the splicing reaction. To test this idea, the authors KD either SF1 or QK1 in mammalian cells and uncover several exons whose splicing regulation follows the predicted pattern of being promoted for splicing by SF1 and repressed by QK1. Importantly, these have introns enriched in SF1 and QK1 motifs. The authors then focus on one exon in particular with two tandem motifs to study the mechanism of this in greater detail and their results confirm the competition model. Mass spec analysis largely agrees with their proposal; however, it is complicated by the apparently quick transition of SF1-bound complexes to later splicing intermediates. An inspired experiment in yeast shows how QK1 competition could potentially have a detrimental impact on splicing in an orthogonal system. Overall, these results show how splicing regulation can be achieved by competition between a "core" and alternative splicing factor and provide additional insight into the complex process of branch site recognition. The manuscript is exceptionally clear and the figures and data are very logically presented. The work will be valuable to those in the splicing field who are interested in both mechanism and bioinformatics approaches to deconvolve any apparent "splicing code" being used by cells to regulate gene expression. Criticisms are minor and the most important of them stem from overemphasis on parts of the manuscript on the evolutionary angle when evolution itself wasn't analyzed per se.

      We thank the reviewer for the positive comments and very clear and fair critical points.

      Strengths:

      (1) The main discovery of the manuscript involving evidence for SF1/QK1 competition is quite interesting and important for this field. This evidence has been missing and may change how people think about branch site recognition.

      (2) The experiments and the rationale behind them are exceptionally clearly and logically presented. This was wonderful!

      Thank you so much. We felt the overall flow of the paper and data make for a nice “story” that conveys a relatively easy-to-understand explanation for a complex subject.

      (3) The experiments are carried out to a high standard and well-designed controls are included.

      (4) The extrapolation of the result to yeast in order to show the potentially devastating consequences of the QK1 competition was very exciting and creative.

      We agree this is a very exciting result and finding! Thanks.

      Weaknesses:

      Overall the weaknesses are relatively minor and involve cases where clarification is necessary, some additional analysis could bolster the arguments, and suggestions for focusing the manuscript on its strengths.

      (1) The title (Ancient...evolutionary outcomes), abstract, and some parts of the discussion focus heavily on the evolutionary implications of this work. However, evolutionary analysis was not performed in these studies (e.g., when did QK1 and SF1 proteins arise and/or diverge? How does this line up with branch site motifs and evolution of U2? Any insight from recent work from Scott Roy et al?). I think this aspect either needs to be bolstered with experimental work/data or this should be tamped down in the manuscript. I suggest highlighting the idea expressed in the sentence "A nuanced implication of this model is that loss-of-function...". To me, this is better supported by the data and potentially by some analysis of mutations associated with human disease.

      We have revised the title and dampened the evolutionary aspects of the previous version of the manuscript.

      (2) One paper that I didn't see cited was that by Tanackovic and Kramer (Mol Biol Cell 2005). This paper is relevant because they KD SF1 and found it nonessential for splicing in vivo. Do their results have implications for those here? How do the results of the KD compare? Could QK1 competition have influenced their findings (or does their work influence the "nuanced implication" model referenced above?)?

      This is an interesting point, and thank you for the suggestion. We have now included a brief description of this study in the Introduction of the revised manuscript and do note that the authors measured intron retention of a beta globin reporter and SF3A1, SF3A2, and SF3A3 during SF1 knockdown, but did not detect elevated unspliced RNA in these targets.

      (3) Can the authors please provide a citation for the statement "degeneracy is observed to a higher degree in organisms with more alternative splicing"? Does recent evolutionary analysis support this?

      We have removed the statement, as it did not add much to the content and I am not sure I can state the concept I was attempting to convey in a simple manner with few citations.

      (4) For the data in Figure 3, I was left wondering if NMD was confounding this analysis. Can the authors respond to this and address this concern directly?

      We have not measured if the reporters used in Figure 3 produce protein(s). Presumably, though, all spliced reporter RNA would be degraded equally (the included/skipped isoforms’ “reading frames” are not altered from one another). This would not be case for unspliced nuclear reporter RNA, however. Given this difference, and that our analysis can not resolve the subcellular localization of the different reporter species, we have removed the measurement of and subsequent results describing unspliced reporter RNA from Figure 3.

      (5) To me, the idea that an engaged U2 snRNP was pulled down in Figure 4F would be stronger if the snRNA was detected. Was that able to be observed by northern or primer extension? Would SF1 be enriched if the U2 snRNA was degraded by RNaseH in the NE?

      We did not measure any co-associating RNAs in this experimental approach, but agree that this approach would strengthen the evidence for it.

      (6) I'm wondering how additive the effects of QK1 and SF1 are... In Figure 2, if QK1 and SF1 are both knocked down, is the splicing of exon 11 restored to "wt" levels?

      This is an interesting question that we were unfortunately unable to address experimentally here.

      (7) The first discussion section has two paragraphs that begin "How does competition between SF1..." and "Relatively little is known about how...". I found the discussion and speculation about localization, paraspekles, and lncRNAs interesting but a bit detracting from the strengths of the manuscript. I would suggest shortening these two paragraphs into a single one.

      We have revised the Discussion.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors were trying to establish whether competition between the RNA-binding proteins SF1 and QKI controlled splicing outcomes. These two proteins have similar binding sites and protein sequences, but SF1 lacks a dimerization motif and seems to bind a single version of the binding sequence. Importantly, these binding sequences correspond to branchpoint consensus sequences, with SF1 binding leading to productive splicing, but QKI binding leading instead to association with paraspeckle proteins. They show that in human cells SF1 generally activates exons and QKI represses, and a large group of the jointly regulated exons (43% of joint targets) are reciprocally controlled by SF1 and QKI. They focus on one of these exons RAI14 that shows this reciprocal pattern of regulation, and has 2 repeats of the binding site that make it a candidate for joint regulation, and confirm regulation within a minigene context. The authors used the assembly of proteins within nuclear extracts to explain the effect of QKI versus SF1 binding. Finally, the authors show that the expression of QKI is lethal in yeast, and causes splicing defects.

      How this fits in the field. This study is interesting and provides a conceptual advance by providing a general rule on how SF1 and QKI interact in relation to binding sites, and the relative molecular fates followed, so is very useful. Most of the analysis seems to focus on one example, although the molecular analysis and global work significantly add to the picture from the previously published paper about NUMB joint regulation by QKI and SF (Zong et al, cited in text as reference 50, that looked at SF1 and QKI binding in relation to a duplicated binding site/branchpoint sequence in NUMB).

      Thank you for the encouraging remarks.

      Strengths:

      The data presented are strong and clear. The ideas discussed in this paper are of wide interest, and present a simple model where two binding sites generate a potentially repressive QKI response, whereas exons that have a single upstream sequence are just regulated by SF1. The assembly of splicing complexes on RNAs derived from RAI14 in nuclear extracts, followed by mass spec gave interesting mechanistic insight into what was occurring as a result of QKI versus SF1 binding.

      Weaknesses:

      I did not think the title best summarises the take-home message and could be perhaps a bit more modest. Although the authors investigated splicing patterns in yeast and human cells, yeast do not have QKI so there is no ancient competition in that case, and the study did not really investigate physiological or evolutionary outcomes in splicing, although it provides interesting speculation on them. Also as I understood it, the important issue was less conserved branchpoints in higher eukaryotes enabling alternative splicing, rather than competition for the conserved branchpoint sequence. So despite the the data being strong and properly analysed and discussed in the paper, could the authors think whether they fit best with the take-home message provided in the title? Just as a suggestion (I am sure the authors can do a better job), maybe "molecular competition between variant branchpoint sequences predict physiological and evolutionary outcomes in splicing"?

      Thank you for this point (Reviewer 2 had a similar comment) and the suggestion. We have revised the title.

      Although the authors do provide some global data, most of the detailed analysis is of RAI14. It would have been useful to examine members of the other quadrants in Figure 1C as well for potential binding sites to give a reason why these are not co-regulated in the same way as RAI14. How many of the RAI14 quadrants had single/double sites (the motif analysis seemed to pull out just one), and could one of the non-reciprocally regulated exons be moved into a different quadrant by addition or subtraction of a binding site or changing the branchpoint (using a minigene approach for example).

      This is an interesting point that we have considered. Our intent with the focus on RAI14 was to use a naturally occurring intron bps with evidence of strong QKI binding that did not require a high degree of sequence manipulation or engineering.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Most of my recommendations are really centered on the figures. In their current state, they detract from the data shown and could be improved: I recommend the authors use a uniform font. For example, Figure 1E and F have at least three different fonts of varying sizes making it very messy. In Figure 1C, the authors could bold the Ral14 ex11 or simply indicate that the blue is this exon in the legend, thus removing the text from this very busy graph. In Figure 4F, I would recommend, having all the labels the same size and putting those genes of interest like Sf3a1 in bold. This could also be done in Figure 4E.

      Thank you for the suggestion and we have edited these (FYI the font in Fig’s 1E and 1F were from the rMAPS default output, but I agree, it gives a sloppy appearance).

      (2) In Figures 4D and 4G, is there QKI binding to the downstream deletion mutant after 30 minutes? Also, in Figure 4G, are these all from the same blot? The band sizes seem to be very different between lanes. If these were not on the same blot, the original gels should be submitted.

      A small amount of Qki appears to be binding after 30 min. All lanes/blots are from the same gels/membranes; see new Supplemental Figure 4 for the original (uncropped) images of the blots.

      (3) The authors should indicate, the source and concentration of the antibodies used for their WB. They should also indicate the primers used for RT-PCRs.

      We have revised the methods to include the antibody information and have uploaded a supplemental table 8 with all oligonucleotide sequences used (which I (Sam Fagg) neglected to do initially, so that’s my bad).

      Reviewer #2 (Recommendations for the authors):

      (1) This may come down to the author's preference but branch point and branch site are frequently two words, not a single compound word (branch point vs. branchpoint). In addition, the authors may want to use branchsite with the abbreviation BS more frequently since they often don't describe the specific point of branching, and bp and bps could be confused for the more frequent abbreviations for base pair(s).

      Good suggestion; we have edited the text accordingly.

      (2) In general the addition of page numbers and line numbers to the manuscript would greatly aid reviewers!

      Point taken…

      (3) Introduction; "...under normal growth conditions they are efficiently spliced". I would say MOST introns in yeast are efficiently spliced. This is definitely not universal.

      Text edited to indicate that most are efficiently spliced.

      (4) Introduction; " recognition of the bps by SF1 (mammals) (20)". The choice of reference 20 is an odd one here. I think the Robin Reed and Michael Rosbash paper was the first to show SF1 was the human homolog of BBP.

      Got it, thanks (added #14 here and kept #20 also since it shows the structure of SF1 in complex with a UACUAAC bps.)

      (5) Results; "QK1 and SF1 co-regulate.."; it may be useful for the reader if you could explain in more detail why exon inclusion and intron retention are expected outcomes for QK1 knockdown and vice versa for SF1. The exon inclusion here is more obvious than the intron retention phenotype. (In other words, if more exons are included shouldn't it follow that more introns are removed?)

      We explain the expected results for exon inclusion in the Introduction and this paragraph of the Results. Although we have observed more intron retention under QKI loss-of-function approaches before, I am uncertain where the reviewer sees that we indicate any expected result for intron retention from either QKI or SF1 knockdown. I believe the statement you refer to might be on line 162 and starts with: “Consistent with potentially opposing functions in splicing…” ?

      Also, I agree that if SF1 is a “splicing activator,” one might expect more IR in its absence (but this is not the case; there is, in fact, less), but nonetheless, the opposite outcome is observed with QKI knockdown (more IR). It is unclear why this is the case, and we did not investigate it.

      (6) Results; "QK1 and SF1 co-regulate.."; "Thus the most highly represented set.." To me, the most highly represented set is those which are not both QK1-repressed and SF1-activated. Does this indicate that other factors are involved at most sites than simple competition between these two?

      We have revised the sentence in question to include the text “by quadrant” in order to convey our meaning more precisely.

      (7) Throughout the manuscript, 5 apostrophes and 3 apostrophes are used instead of 5 prime symbols and 3 prime symbols.

      Thank you for pointing that out. We have fixed each instance of this.

      (8) Sometimes SF1 is written as Sf1. (also Tatsf1)

      This was a mouse/human gene/protein nomenclature error that we have fixed; thank you for pointing this out.

      (9) You may want to make sure that figures are labeled consistently with the manuscript text. In Figure 1B, it is RI rather than IR. In Figure 4 it is myoblast NE rather than C2C12 nuclear extract.

      We have fixed these, checked for other examples, and where relevant, edited those too.

      (10) I think Figure 1A could be improved by also including a depiction of the domain arrangements of SF1 and QK1.

      Done.

      (11) I was a bit confused with all the lines in Figure 1E and 1F. What is the difference between the log (pVal) and upregulated plots? Can these figures be simplified or explained more thoroughly?

      Based on this comment and one from Reviewer 1, we have slightly revised the wording (and font) on the output, which hopefully clarifies. These are motif enrichment plots generated by rMAPS (Refs 61 and 62) analysis of rMATS (Ref 60) data for exons more included (depicted by the red lines) or more skipped (depicted by the blue lines) compared to control versus a “background” set of exons that are detectable but unchanged. The -log<sub>10</sub> is P-value (dotted line) indicates the significance of exons more included in shRNA treatment vs control shRNA (previously read “upregulated”) compared to background exons that are detectable but unchanged; the solid lines indicate the motif score; these are described in the references indicated.

      (12) Figure 1B, it is a bit hard to conclude that there is more AltEx or "RI/IR" in one sample vs. the other from these plots since the points overlay one another. Can you include numbers here?

      Added (and deleted Suppl Fig S1, which was simply a chart showing the numbers).

      (13) How was PSI calculated in Figure 2A?

      VAST-tools (we state this in the legend in the revised version).

      You may want to include rel protein (or the lower limit of detection) for Figure 2B to be consistent with 2C. Why is KD of SF1 so poor and variable between 2C and 2D?

      We have not investigated this, but these blots show an optimized result that we were able to obtain for the knockdown in each cell type. It may be that HEK293 cells (Fig 2B) have a stronger requirement for SF1 than C2C12 cells…? I would argue that it is not necessarily “poor” in Fig 2C, as we observe ~70% depletion of the protein.

      Why are two bands present in the gel?

      Two to three isoforms of SF1 are present in most cell types.

      A good (or bad, really) example of an SF1 western blot (and knockdown of ~35% in K562 or ~45% in HepG2 can also be seen on the ENCODE project website, for reference:

      https://www.encodeproject.org/documents/6001a414-b096-4073-94ff-3af165617eb5/@@download/attachment/SF1_BGKLV28-49.pdf

      By comparison, I think ours are much more cosmetically pleasing, and our knockdown (especially in C2C12) is much more efficient.

      (14) Figure 3, The asterisk refers to a cryptic product. Can the uaAcuuuCAG be used as a branch point? Presumably the natural 3' SS is now too close so this would result in activation of a downstream 3'SS?

      We did not pursue determining the identity of this minor and likely artefactual product, but we (and others) have observed a similar phenomenon when using splicing reporter-based mutational approaches.

      (15) For the methods. The "RNA extraction, RT -PCR,..." subheading needs to be on its own line. Please add (w/v) or (v/v) to percentages where appropriate. Please convert ug to the symbol for "micro".

      Thank you, we have made these changes.

      (16) In Figure 4B, the text here and legend are microscopic. Even with reading glasses, I couldn't make anything out!

      We have increased the font sizes for the text and scale bar…when referring to “legend” does the reviewer mean the scale bar?

      (17) As a potential discussion item, it is worth noting that SF1 could also repress splicing if it could either not engage with U2AF or be properly displaced by U2 snRNP so the snRNA could pair. I was wondering if QK1 could similarly be activating if it could engage with U2AF. I'm unsure if this could be tested by domain swaps (and is beyond the scope of this paper). It just may be worth speculating about.

      Good point and suggestion…we are looking into this.

      Reviewer #3 (Recommendations for the authors):

      (1) Is the reference in the text to Figure 5F correct for actin splicing (this is just before the discussion)?

      I see references several lines up from this, but I do not see a reference just before the discussion…?

      (2) I was not sure why the minigene experiments showed such high levels of intron retention that seemed to be impacted also by deletion of the branchpoint sequences, and suggest that the two branchpoints are not equal in strength.

      Neither were we, but Reviewer 2 has suggested that degradation of the spliced products could be rapid (NMD substrates) which could complicate the interpretation of what appears to be higher levels of intron retention. Given the possibility that this could be a non-physiological artefact, we have removed the measurement of unspliced reporter and now only show the spliced products (equally subject to degradation) and report their percent inclusion.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the editors of eLife and the reviewers for their thorough evaluation of our study. As regards the final comments of reviewer 1 please note that all experimental replicates were first analyzed separately, and were then pooled, since the observed changes were comparable between experiments. This mean that statistical analyses were done on pooled biological replicates.


      The following is the authors’ response to the original reviews.

      General Statements

      We thank the reviewers for their thorough and constructive evaluation of our work. We have revised the manuscript carefully and addressed all the criticisms raised, in particular the issues mentioned by several of the reviewers (see point-by-point response below). We have also added a number of explanations in the text for the sake of clarity, while trying to keep the manuscript as concise as possible.

      In our view, the novelty of our research is two-fold. From a neurobiological point of view, we provide conclusive evidence for the existence of glycine receptors (GlyRs) at inhibitory synapses in various brain regions including the hippocampus, dentate gyrus and sub-regions of the striatum. This solves several open questions and has fundamental implications for our understanding of the organisation and function of inhibitory synapses in the telencephalon. Secondly, our study makes use of the unique sensitivity of single molecule localisation microscopy (SMLM) to identify low protein copy numbers. This is a new way to think about SMLM as it goes beyond a mere structural characterisation and towards a quantitative assessment of synaptic protein assemblies.

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      In this manuscript, the authors investigate the nanoscopic distribution of glycine receptor subunits in the hippocampus, dorsal striatum, and ventral striatum of the mouse brain using single-molecule localization microscopy (SMLM). They demonstrate that only a small number of glycine receptors are localized at hippocampal inhibitory synapses. Using dual-color SMLM, they further show that clusters of glycine receptors are predominantly localized within gephyrinpositive synapses. A comparison between the dorsal and ventral striatum reveals that the ventral striatum contains approximately eight times more glycine receptors and this finding is consistent with electrophysiological data on postsynaptic inhibitory currents. Finally, using cultured hippocampal neurons, they examine the differential synaptic localization of glycine receptor subunits (α1, α2, and β). This study is significant as it provides insights into the nanoscopic localization patterns of glycine receptors in brain regions where this protein is expressed at low levels. Additionally, the study demonstrates the different localization patterns of GlyR in distinct striatal regions and its physiological relevance using SMLM and electrophysiological experiments. However, several concerns should be addressed. 

      The following are specific comments: 

      (1) Colocalization analysis in Figure 1A. The colocalization between Sylite and mEos-GlyRβ appears to be quite low. It is essential to assess whether the observed colocalization is not due to random overlap. The authors should consider quantifying colocalization using statistical methods, such as a pixel shift analysis, to determine whether colocalization frequencies remain similar after artificially displacing one of the channels. 

      Following the suggestion of reviewer 1, we re-analysed CA3 images of Glrb<sup>eos/eos</sup> hippocampal slices by applying a pixel-shift type of control, in which the Sylite channel (in far red) was horizontally flipped relative to the mEos4b-GlyRβ channel (in green, see Methods). As expected, the number of mEos4b-GlyRβ detections per gephyrin cluster was markedly reduced compared to the original analysis (revised Fig. 1B), confirming that the synaptic mEos4b detections exceed chance levels (see page 5). 

      (2) Inconsistency between Figure 3A and 3B. While Figure 3B indicates an ~8-fold difference in the number of mEos4b-GlyRβ detections per synapse between the dorsal and ventral striatum, Figure 3A does not appear to show a pronounced difference in the localization of mEos4bGlyRβ on Sylite puncta between these two regions. If the images presented in Figure 3A are not representative, the authors should consider replacing them with more representative examples or providing an expanded images with multiple representative examples. Alternatively, if this inconsistency can be explained by differences in spot density within clusters, the authors should explain that. 

      The pointillist images in Fig. 3A are essentially binary (red-black). Therefore, the density of detections at synapses cannot be easily judged by eye. For clarity, the original images in Fig. 3A have been replaced with two other examples that better reflect the different detection numbers in the dorsal and ventral striatum. 

      (3) Quantification in Figure 5. It is recommended that the authors provide quantitative data on cluster formation and colocalization with Sylite puncta in Figure 5 to support their qualitative observations. 

      This is an important point that was also raised by the other reviewers. We have performed additional experiments to increase the data volume for analysis. For quantification, we used two approaches. First, we counted the percentage of infected cells in which synaptic localisation of the recombinant receptor subunit was observed (Fig. 5C). We found that mEos4b-GlyRa1 consistently localises at synapses, indicating that all cells express endogenous GlyRb. When neurons were infected with mEos4b-GlyRb, fewer cells had synaptic clusters, meaning that indeed, GlyR alpha subunits are the limiting factor for synaptic targeting. In cultures infected with mEos4b-GlyRa2, only very few neurons displayed synaptic localisation (as judged by epifluorescence imaging). We think this shows that GlyRa2 is less capable of forming heteromeric complexes than GlyRa1, in line with our previous interpretation (see pp. 9-10, 13). 

      Secondly, we quantified the total intensity of each subunit at gephyrin-positive domains, both in infected neurons as well as non-infected control cultures (Fig. 5D). We observed that mEos4bGlyRa1 intensity at gephyrin puncta was higher than that of the other subunits, again pointing to efficient synaptic targeting of GlyRa1. Gephyrin cluster intensities (Sylite labelling) were not significantly different in GlyRb and GlyRa2 expressing neurons compared to the uninfected control, indicating that the lentiviral expression of recombinant subunits does not fundamentally alter the size of mixed inhibitory synapses in hippocampal neurons. Interestingly, gephyrin levels were slightly higher in hippocampal neurons expressing mEos4b-GlyRa1. In our view, this comes from an enhanced expression and synaptic targeting of mEos4b-GlyRa1 heteromers with endogenous GlyRb, pointing to a structural role of GlyRa1/b in hippocampal synapses (pp. 10, 13).

      The new data and analyses have been described and illustrated in the relevant sections of the manuscript.

      (4) Potential for pseudo replication. It's not clear whether they're performing stats tests across biological replica, images, or even synapses. They often quote mean +/- SEM with n = 1000s, and so does that mean they're doing tests on those 1000s? Need to clarify. 

      All experiments were repeated at least twice to ensure reproducibility (N independent experiments). Statistical tests were performed on pooled data across the biological replicates; n denotes the number of data points used for testing (e.g., number of synaptic clusters, detections, cells, as specified in each case). We have systematically given these numbers in the revised manuscript (n, N, and other experimental parameters such as the number of animals used, coverslips, images or cells). Data are generally given as mean +/- SEM or as mean +/- SD as indicated.

      (5) Does mEoS effect expression levels or function of the protein? Can't see any experiments done to confirm this. Could suggest WB on homogenate, or mass spec? 

      The Glrb<sup>eos/eos</sup> knock-in mouse line has been characterised previously and does not to display any ultrastructural or functional deficits at inhibitory synapses (Maynard et al. 2021 eLife). GlyRβ expression and glycine-evoked responses were not significantly different to those of the wildtype. The synaptic localisation of mEos4b-GlyRb in KI animals demonstrates correct assembly of heteromeric GlyRs and synaptic targeting. Accordingly, the animals do not display any obvious phenotype. We have clarified this in the manuscript (p. 4). In the case of cultured neurons, long-term expression of fluorescent receptor subunits with lentivirus   has proven ideal to achieve efficient synaptic targeting. The low and continuous supply of recombinant receptors ensures assembly with endogenous subunits to form heteropentameric receptor complexes (e.g. [Patrizio et al. 2017 Sci Rep]). In the present study, lentivirus infection did not induce any obvious differences in the number or size of inhibitory synapses compared to control neurons, as judged by Sylite labelling of synaptic gephyrin puncta (new Fig. 5D).

      (6) Quantification of protein numbers is challenging with SMLM. Issues include i) some of FP not correctly folded/mature, and ii) dependence of localisation rate on instrument, excitation/illumination intensities, and also the thresholds used in analysis. Can the authors compare with another protein that has known expression levels- e.g. PSD95? This is quite an ask, but if they could show copy number of something known to compare with, it would be useful. 

      We agree that absolute quantification with SMLM is challenging, since the number of detections depends on fluorophore maturation, photophysics, imaging conditions, and analysis thresholds (discussed in Patrizio & Specht 2016, Neurophotonics). For this reason, only very few datasets provide reliable copy numbers, even for well-studied proteins such as PSD-95. One notable exception is the study by Maynard et al. (eLife 2021) that quantified endogenous GlyRβcontaining receptors in spinal cord synapses using SMLM combined with correlative electron microscopy. The strength of this work was the use of a KI mouse strain, which ensures that mEos4b-GlyRβ expression follows intrinsic regional and temporal profiles. The authors reported a stereotypic density of ~2,000 GlyRs/µm² at synapses, corresponding to ~120 receptors per synapse in the dorsal horn and ~240 in the ventral horn, taking into account various parameters including receptor stoichiometry and the functionality of the fluorophore. These values are very close to our own calculations of GlyR numbers at spinal cord synapses that were obtained slightly differently in terms of sample preparation, microscope setup, imaging conditions, and data analysis, lending support to our experimental approach. Nevertheless, the obtained GlyR copy numbers at hippocampal synapses clearly have to be taken as estimates rather than precise figures, because the number of detections from a single mEos4b fluorophore can vary substantially, meaning that the fluorophores are not represented equally in pointillist images. This can affect the copy number calculation for a specific synapse, in particular when the numbers are low (e.g. in hippocampus), however, it should not alter the average number of detections (Fig. 1B) or the (median) molecule numbers of the entire population of synapses (Fig. 1C). We have discussed the limitations of our approach (p. 11).

      (7) Rationale for doing nanobody dSTORM not clear at all. They don't explain the reason for doing the dSTORM experiments. Why not just rely on PALM for coincidence measurements, rather than tagging mEoS with a nanobody, and then doing dSTORM with that? Can they explain? Is it to get extra localisations- i.e. multiple per nanobody? If so, localising same FP multiple times wouldn't improve resolution. Also, no controls for nanobody dSTORM experiments- what about non-spec nb, or use on WT sections? 

      As discussed above (point 6), the detection of fluorophores with SMLM is influenced by many parameters, not least the noise produced by emitting molecules other than the fluorophore used for labelling. Our study is exceptional in that it attempts to identify extremely low molecule numbers (down to 1). To verify that the detections obtained with PALM correspond to mEos4b, we conducted robust control experiments (including pixel-shift as suggested by the reviewer, see point 1, revised Fig. 1B). The rationale for the nanobody-based dSTORM experiments was twofold: (1) to have an independent readout of the presence of low-copy GlyRs at inhibitory synapses and (2) to analyse the nanoscale organisation of GlyRs relative to the synaptic gephyrin scaffold using dual-colour dSTORM with spectral demixing (see p. 6). The organic fluorophores used in dSTORM (AF647, CF680) ensure high photon counts, essential for reliable co-localisation and distance analysis. PALM and dSTORM cannot be combined in dual-colour mode, as they require different buffers and imaging conditions. 

      The specificity of the anti-Eos nanobody was demonstrated by immunohistochemistry in spinal cord cultures expressing mEos4b-GlyRb and wildtype control tissue (Fig. S3). In response to the reviewer's remarks, we also performed a negative control experiment in Glrb<sup>eos/eos</sup> slices (dSTORM), in which the nanobody was omitted (new Fig. S4F,G). Under these conditions, spectral demixing produced a single peak corresponding to CF680 (gephyrin) without any AF647 contribution (Fig. S4F). The background detection of "false" AF647 detections at synapses was significantly lower than in the slices labelled with the nanobody. We conclude that the fluorescence signal observed in our dual-colour dSTORM experiments arises from the specific detection of mEos4b-GlyRb by the nanobody, rather than from background, crossreactivity or wrong attribution of colour during spectral demixing. We have added these data and explanations in the results (p. 7) and in the figure legend of Fig. S4F,G.

      (8) What resolutions/precisions were obtained in SMLM experiments? Should perform Fourier Ring Correlation (FRC) on SR images to state resolutions obtained (particularly useful for when they're presenting distance histograms, as this will be dependent on resolution). Likewise for precision, what was mean precision? Can they show histograms of localisation precision. 

      This is an interesting question in the context of our experiments with low-copy GlyRs, since the spatial resolution of SMLM is limited also by the density of molecules, i.e. the sampling of the structure in question (Nyquist-Shannon criterion). Accordingly, the priority of the PALM experiments was to improve the sensibility of SMLM for the identification of mEos4b-GlyRb subunits, rather than to maximize the spatial resolution. The mean localisation precision in PALM was 33 +/- 12 nm, as calculated from the fitting parameters of each detection (Zeiss, ZEN software), which ultimately result from their signal-to-noise ratio. This is a relatively low precision for SMLM, which can be explained by the low brightness of mEos4b compared to organic fluorophores together with the elevated fluorescence background in tissue slices.

      In the case of dSTORM, the aim was to study the relative distribution of GlyRs within the synaptic scaffold, for which a higher localisation precision was required (p. 6). Therefore, detections with a precision ≥ 25 nm were filtered during analysis with NEO software (Abbelight). The retained detections had a mean localisation precision of 12 +/- 5 for CF680 (Sylite) and 11 +/- 4 for AF647 (nanobody). These values are given in the revised manuscript (pp. 18, 22).

      (9) Why were DBSCAN parameters selected? How can they rule out multiple localisations per fluor? If low copy numbers (<10), then why bother with DBSCAN? Could just measure distance to each one. 

      Multiple detections of the same fluorophore are intrinsic to dSTORM imaging and have not been eliminated from the analysis. Small clusters of detections likely represent individual molecules (e.g. single receptors in the extrasynaptic regions, Fig. 2A). DBSCAN is a robust clustering method that is quite insensitive to minor changes in the choice of parameters. For dSTORM of synaptic gephyrin clusters (CF680), a relatively low length (80 nm radius) together with a high number of detections (≥ 50 neighbours) were chosen to reconstruct the postsynaptic domain with high spatial resolution (see point 8). In the case of the GlyR (nanobody-AF647), the clustering was done mostly for practical reasons, as it provided the coordinates of the centre of mass of the detections. The low stringency of this clustering (200 nm radius, ≥ 5 neighbours) effectively filters single detections that can result from background noise or incorrect demixing. An additional reference explaining the use of DBSCAN including the choice of parameters is given on p. 22 (see also R2 point 4).

      (10) For microscopy experiment methods, state power densities, not % or "nominal power". 

      Done. We now report the irradiance (laser power density) instead of nominal power (pp. 18, 21). 

      (11) In general, not much data presented. Any SI file with extra images etc.? 

      The original submission included four supplementary figures with additional data and representative images that should have been available to the reviewer (Figs. S1-S4). The SI file has been updated during revision (new Fig. S4E-G). 

      (12) Clarification of the discussion on GlyR expression and synaptic localization: The discussion on GlyR expression, complex formation, and synaptic localization is sometimes unclear, and needs terminological distinctions between "expression level", "complex formation" and "synaptic localization". For example, the authors state:"What then is the reason for the low protein expression of GlyRβ? One possibility is that the assembly of mature heteropentameric GlyR complexes depends critically on the expression of endogenous GlyR α subunits." Does this mean that GlyRβ proteins that fail to form complexes with GlyRα subunits are unstable and subject to rapid degradation? If so, the authors should clarify this point. The statement "This raises the interesting possibility that synaptic GlyRs may depend specifically on the concomitant expression of both α1 and β transcripts." suggests a dependency on α1 and β transcripts. However, is the authors' focus on synaptic localization or overall protein expression levels? If this means synaptic localization, it would be beneficial to state this explicitly to avoid confusion. To improve clarity, the authors should carefully distinguish between these different aspects of GlyR biology throughout the discussion. Additionally, a schematic diagram illustrating these processes would be highly beneficial for readers. 

      We thank the reviewer to point this out. We are dealing with several processes; protein expression that determines subunit availability and the assembly of pentameric GlyRs complexes, surface expression, membrane diffusion and accumulation of GlyRb-containing receptor complexes at inhibitory synapses. We have edited the manuscript, particularly the discussion and tried to be as clear as possible in our wording.

      We chose not to add a schematic illustration for the time being, because any graphical representation is necessarily a simplification. Instead, we preferred to summarise the main numbers in tabular form (Table 1). We are of course open to any other suggestions.

      (13) Interpretation of GlyR localization in the context of nanodomains. The distribution of GlyR molecules on inhibitory synapses appears to be non-homogeneous, instead forming nanoclusters or nanodomains, similar to many other synaptic proteins. It is important to interpret GlyR localization in the context of nanodomain organization. 

      The dSTORM images in Fig. 2 are pointillist representations that show individual detections rather than molecules. Small clusters of detections are likely to originate from a single AF647 fluorophore (in the case of nanobody labelling) and therefore represent single GlyRb subunits. Since GlyR copy numbers are so low at hippocampal synapses (≤ 5), the notion of nanodomain is not directly applicable. Our analysis therefore focused on the integration of GlyRs within the postsynaptic scaffold, rather than attempting to define nanodomain structures (see also response to point 8 of R1). A clarification has been added in the revised manuscript (p. 6).

      Reviewer #1 (Significance): 

      The paper presents biological and technical advances. The biological insights revolve mostly on the documentation of Glycine receptors in particular synapses in forebrain, where they are typically expressed at very low levels. The authors provide compelling data indicating that the expression is of physiological significance. The authors have done a nice job of combining genetically-tagged mice with advanced microscopy methods to tackle the question of distributions of synaptic proteins. Overall these advances are more incremental than groundbreaking. 

      We thank the reviewer for acknowledging both the technical and biological advances of our study. While we recognize that our work builds upon established models, we consider that it also addresses important unresolved questions, namely that GlyRs are present and specifically anchored at inhibitory synapses in telencephalic regions, such as the hippocampus and striatum. From a methodological point of view, our study demonstrates that SMLM can be applied not only for structural analysis of highly abundant proteins, but also to reliably detect proteins present at very low copy numbers. This ability to identify and quantify sparse molecule populations adds a new dimension to SMLM applications, which we believe increases the overall impact of our study beyond the field of synaptic neuroscience.

      Reviewer #2 (Evidence, reproducibility and clarity): 

      In their manuscript "Single molecule counting detects low-copy glycine receptors in hippocampal and striatal synapses" Camuso and colleagues apply single molecule localization microscopy (SMLM) methods to visualize low copy numbers of GlyRs at inhibitory synapses in the hippocampal formation and the striatum. SMLM analysis revealed higher copy numbers in striatum compared to hippocampal inhibitory synapses. They further provide evidence that these low copy numbers are tightly linked to post-synaptic scaffolding protein gephyrin at inhibitory synapses. Their approach profits from the high sensitivity and resolution of SMLM and challenges the controversial view on the presence of GlyRs in these formations although there are reports (electrophysiology) on the presence of GlyRs in these particular brain regions. These new datasets in the current manuscript may certainly assist in understanding the complexity of fundamental building blocks of inhibitory synapses. 

      However I have some minor points that the authors may address for clarification: 

      (1) In Figure 1 the authors apply PALM imaging of mEos4b-GlyRß (knockin) and here the corresponding Sylite label seems to be recorded in widefield, it is not clearly stated in the figure legend if it is widefield or super-resolved. In Fig 1 A - is the scale bar 5 µm? Some Sylite spots appear to be sized around 1 µm, especially the brighter spots, but maybe this is due to the lower resolution of widefield imaging? Regarding the statistical comparison: what method was chosen to test for normality distribution, I think this point is missing in the methods section. 

      This is correct; the apparent size of the Sylite spots does not reflect the real size of the synaptic gephyrin domain due to the limited resolution of widefield imaging including the detection of outof-focus light. We have clarified in the legend of Fig. 1A that Sylite labelling was with classic epifluorescence microscopy. The scale bar in Fig. 1A corresponds to 5 µm. Since the data were not normally distributed, nonparametric tests (Kruskal- Wallis one-way ANOVA with Dunn’s multiple comparison test or Mann-Whitney U-test for pairwise comparisons) were used (p. 23). 

      Moreover I would appreciate a clarification and/or citation that the knockin model results in no structural and physiological changes at inhibitory synapses, I believe this model has been applied in previous studies and corresponding clarification can be provided. 

      The Glrbeos/eos mouse model has been described previously and does not exhibit any structural or physiological phenotypes (Maynard et al. 2021 eLife). The issue was also raised by reviewer R1 (point 5) and has been clarified in the revised manuscript (p. 4).

      (2) In the next set of experiments the authors switch to demixing dSTORM experiments - an explanation why this is performed is missing in the text - I guess better resolution to perform more detailed distance measurements? For these experiments: which region of the hippocampus did the authors select, I cannot find this information in legend or main text. 

      Yes, the dSTORM experiments enable dual-colour structural analysis at high spatial resolution (see response to R1 point 7). An explanation has been added (p. 6).

      (3) Regarding parameters of demixing experiments: the number of frames (10.000) seems quite low and the exposure time higher than expected for Alexa 647. Can the authors explain the reason for chosing these particular parameters (low expression profile of the target - so better separation?, less fluorophores on label and shorter collection time?) or is there a reference that can be cited? The laser power is given in the methods in percentage of maximal output power, but for better comparison and reproducibility I recommend to provide the values of a power meter (kW/cm2) as lasers may change their maximum output power during their lifetime. 

      Acquisition parameters (laser power, exposure time) for dSTORM were chosen to obtain a good localisation precision (~12 nm; see R1 point 8). The number of frames is adequate to obtain well sampled gephyrin scaffolds in the CF680 channel. In the case of the GlyR (nanobody-AF647), the concept of spatial resolution does not really apply due to the low number of targets (see R1, point 13). Power density (irradiance) values have now been given (pp. 18, 21).

      (4) For analysis of subsynaptic distribution: how did the authors decide to choose the parameters in the NEO software for DBSCAN clustering - was a series of parameters tested to find optimal conditions and did the analysis start with an initial test if data is indeed clustered (K-ripley) or is there a reference in literature that can be provided? 

      DBSCAN parameters were optimised manually, by testing different values. Identification of dense and well-delimited gephyrin clusters (CF680) was achieved with a small radius and a high number of detections (80 nm, ≥ 50 neighbours), whereas filtering of low-density background in the AF647 channel (GlyRs) required less stringent parameters (200 nm, ≥ 5) due to the low number of target molecules. Similar parameters were used in a previous publication (Khayenko et al. 2022, Angewandte Chemie). The reference has been provided on p. 22 (see also R1 point 9).

      (5) A conclusion/discussion of the results presented in Figure 5 is missing in the text/discussion. 

      This part of the manuscript has been completely overhauled. It includes new experimental data, quantification of the data (new Fig.5), as well as the discussion and interpretation of our findings (see also R1, point 3). In agreement with our earlier interpretation, the data confirm that low availability of GlyRa1 subunits limits the expression and synaptic targeting of GlyRa1/b heteropentamers. The observation that GlyRa1 overexpression with lentivirus increases the size of the postsynaptic gephyrin domain further points to a structural role, whereby GlyRs can enhance the stability (and size) of inhibitory synapses in hippocampal neurons, even at low copy numbers (pp. 13-14). 

      (6) In line 552 "suspension" is misleading, better use "solution" 

      Done.

      Reviewer #2 (Significance): 

      Significance: The manuscript provides new insights to presence of low-copy numbers by visualizing them via SMLM. This is the first report that visualizes GlyR optically in the brain applying the knock-in model of mEOS4b tagged GlyRß and quantifies their copy number comparing distribution and amount of GlyRs from hippocampus and striatum. Imaging data correspond well to electrophysiological measurements in the manuscript. 

      Field of expertise: Super-Resolution Imaging and corresponding analysis 

      Reviewer #4 (Evidence, reproducibility and clarity): 

      In this study, Camuso et al., make use of a knock-in mouse model expressing endogenously mEos4b-tagged GlyRβ to detect endogenous glycine receptors using single-molecule localization microscopy. The main conclusion from this study is that in the hippocampus GlyRβ molecules are barely detected, while inhibitory synapses in the ventral striatum seem to express functionally relevant GlyR numbers. 

      I have a few points that I hope help to improve the strength of this study. 

      - In the hippocampus, this study finds that the numbers of detections are very low. The authors perform adequate controls to indicate that these localizations are above noise level. Nevertheless, it remains questionable that these reflect proper GlyRs. The suggestion that in hippocampal synapses the low numbers of GlyRβ molecules "are important in assembly or maintenance of inhibitory synaptic structures in the brain" is on itself interesting, but is not at all supported. It is also difficult to envision how such low numbers could support the structure of a synapse. A functional experiment showing that knockdown of GlyRs affects inhibitory synapse structure in hippocampal neurons would be a minimal test of this. 

      It is not clear what the reviewer means by “it remains questionable that these reflect proper GlyRs”. The PALM experiments include a series of stringent controls (see R1, point 1) demonstrating the existence of low-copy GlyRs at inhibitory synapses in the hippocampus (Fig. 1) and in the striatum (Fig. 3), and are backed up by dSTORM experiments (Fig. 2). We have no reason to doubt that these receptors are fully functional (as demonstrated for the ventral striatum (Fig. 4). However, due to their low number, a role in inhibitory synaptic transmission is clearly limited, at least in the hippocampus and dorsal striatum. 

      We therefore propose a structural role, where the GlyRs could be required to stabilise the postsynaptic gephyrin domain in hippocampal neurons. This is based on the idea that the GlyRgephyrin affinity is much higher than that of the GABAAR-gephyrin interaction (reviewed in Kasaragod & Schindelin 2018 Front Mol Neurosci). Accordingly, there is a close relationship between GlyRs and gephyrin numbers, sub-synaptic distribution, and dynamics in spinal cord synapses that are mostly glycinergic (Specht et al. 2013 Neuron; Maynard et al. 2021 eLife; Chapdelaine et al. 2021 Biophys J). It is reasonable to assume that low-copy GlyRs could play a similar structural role at hippocampal synapses. A knockdown experiment targeting these few receptors is technically very challenging and beyond the scope of this study. However, in response to the reviewer's question we have conducted new experiments in cultured hippocampal neurons (new Fig. 5). They demonstrate that overexpression of GlyRa1/b heteropentamers increases the size of the postsynaptic domain in these neurons, supporting our interpretation of a structural role of low-copy GlyRs (p. 14).

      - The endogenous tagging strategy is a very strong aspect of this study and provides confidence in the labeling of GlyRβ molecules. One caveat however, is that this labeling strategy does not discriminate whether GlyRβ molecules are on the cell membrane or in internal compartments. Can the authors provide an estimate of the ratio of surface to internal GlyRβ molecules? 

      Gephyrin is known to form a two-dimensional scaffold below the synaptic membrane to which inhibitory GlyRs and GABAARs attach (reviewed in Alvarez 2017 Brain Res). The majority of the synaptic receptors are therefore thought to be located in the synaptic membrane, which is supported by the close relationship between the sub-synaptic distribution of GlyRs and gephyrin in spinal cord neurons (e.g. Maynard et al. 2021 eLife). To demonstrate the surface expression of GlyRs at hippocampal synapses we labelled cultured hippocampal neurons expressing mEos4b-GlyRa1 with anti-Eos nanobody in non-permeabilised neurons (see Author response image 1). The close correspondence between the nanobody (AF647) and the mEos4b signal confirms that the majority of the GlyRs are indeed located in the synaptic membrane.

      Author response image 1.

      Left: Lentivirus expression of mEos4b-GlyRa1 in fixed and non-permeabilised hippocampal neurons (mEos4b signal). Right: Surface labelling of the recombinant subunit with anti-Eos nanoboby (AF647). 

      - “We also estimated the absolute number of GlyRs per synapse in the hippocampus. The number of mEos4b detections was converted into copy numbers by dividing the detections at synapses by the average number of detections of individual mEos4b-GlyRβ containing receptor complexes”. In essence this is a correct method to estimate copy numbers, and the authors discuss some of the pitfalls associated with this approach (i.e., maturation of fluorophore and detection limit). Nevertheless, the authors did not subtract the number of background localizations determined in the two negative control groups. This is critical, particularly at these low-number estimations. 

      We fully agree that background subtraction can be useful with low detection numbers. In the revised manuscript, copy numbers are now reported as background-corrected values. Specifically, the mean number of detections measured in wildtype slices was used to calculate an equivalent receptor number, which was then subtracted from the copy number estimates across hippocampus, spinal cord and striatum. This procedure is described in the methods (p. 20) and results (p. 5, 8), and mentioned in the figure legends of Fig. 1C, 3C. The background corrected values are given in the text and Table 1.

      - Furthermore, the authors state that "The advantage of this estimation is that it is independent of the stoichiometry of heteropentameric GlyRs". However, if the stoichometry is unknown, the number of counted GlyRβ subunits cannot simply be reported as the number of GlyRs. This should be discussed in more detail, and more carefully reported throughout the manuscript. 

      The reviewer is right to point this out. There is still some debate about the stoichiometry of heteropentameric GlyRs. Configurations with 2a:3b, 3a:2b and 4a:1b subunits have been advanced (e.g. Grudzinska et al. 2005 Neuron; Durisic et al. 2012 J Neurosci; Patrizio et al. 2017 Sci Rep; Zhu & Gouaux 2021 Nature). We have therefore chosen a quantification that is independent of the underlying stoichiometry. Since our quantification is based on very sparse clusters of mEos4b detections that likely originate from a single receptor complex (irrespective of its stoichiometry), the reported values actually reflect the number of GlyRs (and not GlyRb subunits). We have clarified this in the results (p. 5) and throughout the manuscript (Table 1). 

      - The dual-color imaging provides insights in the subsynaptic distribution of GlyRβ molecules in hippocampal synapses. Why are similar studies not performed on synapses in the ventral striatum where functionally relevant numbers of GlyRβ molecules are found? Here insights in the subsynaptic receptor distribution would be of much more interest as it can be tight to the function. 

      This is an interesting suggestion. However, the primary aim of our study was to identify the existence of GlyRs in hippocampal regions. At low copy numbers, the concept of sub-synaptic domains (SSDs, e.g. Yang et al. 2021 EMBO Rep) becomes irrelevant (see R1 point 13). It should be pointed out that the dSTORM pointillist images (Fig. 2A) represent individual GlyR detections rather than clusters of molecules. In the striatum, our specific purpose was to solve an open question about the presence of GlyRs in different subregions (putamen, nucleus accumbens).

      - It is unclear how the experiments in Figure 5 add to this study. These results are valid, but do not seem to directly test the hypothesis that "the expression of α subunits may be limiting factor controlling the number of synaptic GlyRs". These experiments simply test if overexpressed α subunits can be detected. If the α subunits are limiting, measuring the effect of α subunit overexpression on GlyRβ surface expression would be a more direct test. 

      Both R1 and R2 have also commented on the data in Fig. 5 and their interpretation. We have substantially revised this section as described before (see R1 point 3) including additional experiments and quantification of the data (new Fig. 5). The findings lend support to our earlier hypothesis that GlyR alpha subunits (in particular GlyRa1) are the limiting factor for the expression of heteropentameric GlyRa/b in hippocampal neurons (pp. 13-14). Since the GlyRa1 subunit itself does not bind to gephyrin (Patrizio et al. 2017 Sci Rep), the synaptic localisation of the recombinant mEos4b-GlyRa1 subunits is proof that they have formed heteropentamers with endogenous GlyRb subunits and driven their membrane trafficking, which the GlyRb subunits are incapable of doing on their own.

      Reviewer #4 (Significance): 

      These results are based on carefully performed single-molecule localization experiments, and are well-presented and described. The knockin mouse with endogenously tagged GlyRβ molecules is a very strong aspect of this study and provides confidence in the labeling, the combination with single-molecule localization microscopy is very strong as it provides high sensitivity and spatial resolution. 

      The conceptual innovation however seems relatively modest, these results confirm previous studies but do not seem to add novel insights. This study is entirely descriptive and does not bring new mechanistic insights. 

      This study could be of interest to a specialized audience interested in glycine receptor biology, inhibitory synapse biology and super-resolution microscopy. 

      My expertise is in super-resolution microscopy, synaptic transmission and plasticity 

      As we have stated before, the novelty of our study lies in the use of SMLM for the identification of very small numbers of molecules, which requires careful control experiments. This is something that has not been done before and that can be of interest to a wider readership, as it opens up SMLM for ultrasensitive detection of rare molecular events. Using this approach, we solve two open scientific questions: (1) the demonstration that low-copy GlyRs are present at inhibitory synapses in the hippocampus, (2) the sub-region specific expression and functional role of GlyRs in the ventral versus dorsal striatum.

      The following review was provided later under the name “Reviewer #4”. To avoid confusion with the last reviewer from above we will refer to this review as R4-2.

      Reviewer #4-2 (Evidence, reproducibility and clarity):  

      Summary:

      Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate).

      The authors investigate the presence of synaptic glycine receptors in the telencephalon, whose presence and function is poorly understood. 

      Using a transgenically labeled glycine receptor beta subunit (Glrb-mEos4b) mouse model together with super-resolution microscopy (SLMM, dSTORM), they demonstrate the presence of a low but detectable amount of synaptically localized GLRB in the hippocampus. While they do not perform a functional analysis of these receptors, they do demonstrate that these subunits are integrated into the inhibitory postsynaptic density (iPSD) as labeled by the scaffold protein gephyrin. These findings demonstrate that a low level of synaptically localized glycerine receptor subunits exist in the hippocampal formation, although whether or not they have a functional relevance remains unknown.

      They then proceed to quantify synaptic glycine receptors in the striatum, demonstrating that the ventral striatum has a significantly higher amount of GLRB co-localized with gephyrin than the dorsal striatum or the hippocampus. They then recorded pharmacologically isolated glycinergic miniature inhibitory postsynaptic currents (mIPSCs) from striatal neurons. In line with their structural observations, these recordings confirmed the presence of synaptic glycinergic signaling in the ventral striatum, and an almost complete absence in the dorsal striatum. Together, these findings demonstrate that synaptic glycine receptors in the ventral striatum are present and functional, while an important contribution to dorsal striatal activity is less likely.

      Lastly, the authors use existing mRNA and protein datasets to show that the expression level of GLRA1 across the brain positively correlates with the presence of synaptic GLRB.

      The authors use lentiviral expression of mEos4b-tagged glycine receptor alpha1, alpha2, and beta subunits (GLRA1, GLRA1, GLRB) in cultured hippocampal neurons to investigate the ability of these subunits to cause the synaptic localization of glycine receptors. They suggest that the alpha1 subunit has a higher propensity to localize at the inhibitory postsynapse (labeled via gephyrin) than the alpha2 or beta subunits, and may therefore contribute to the distribution of functional synaptic glycine receptors across the brain.

      Major comments:

      - Are the key conclusions convincing?

      The authors are generally precise in the formulation of their conclusions.

      (1) They demonstrate a very low, but detectable, amount of a synaptically localized glycine receptor subunit in a transgenic (GlrB-mEos4b) mouse model. They demonstrate that the GLRB-mEos4b fusion protein is integrated into the iPSD as determined by gephyrin labelling. The authors do not perform functional tests of these receptors and do not state any such conclusions.

      (2) The authors show that GLRB-mEos4b is clearly detectable in the striatum and integrated into gephyrin clusters at a significantly higher rate in the ventral striatum compared to the dorsal striatum, which is in line with previous studies.

      (3) Adding to their quantification of GLRB-mEos4b in the striatum, the authors demonstrate the presence of glycinergic miniature IPSCs in the ventral striatum, and an almost complete absence of mIPSCs in the dorsal striatum. These currents support the observation that GLRB-mEos4b is more synaptically integrated in the ventral striatum compared to the dorsal striatum.

      (4) The authors show that lentiviral expression of GLRA1-mEos4b leads to a visually higher number of GLR clusters in cultured hippocampal neurons, and a co-localization of some clusters with gephyrin. The authors claim that this supports the idea that GLRA1 may be an important driver of synaptic glycine receptor localization. However, no quantification or statistical analysis of the number of puncta or their colocalization with gephyrin is provided for any of the expressed subunits. Such a claim should be supported by quantification and statistics 

      A thorough analysis and quantification of the data in Fig.5 has been carried out as requested by all the other reviewers (e.g. R1, point 3). The new data and results have been described in the revised manuscript (pp. 9-10, 13-14).

      - Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      One unaddressed caveat is the fact that a GLRB-mEos4b fusion protein may behave differently in terms of localization and synaptic integration than wild-type GLRB. While unlikely, it is possible that mEos4b interacts either with itself or synaptic proteins in a way that changes the fused GLRB subunit’s localization. Such an effect would be unlikely to affect synaptic function in a measurable way, but might be detected at a structural level by highly sensitive methods such as SMLM and STORM in regions with very low molecule numbers (such as the hippocampus). Since reliable antibodies against GLRB in brain tissue sections are not available, this would be difficult to test. Considering that no functional measures of the hippocampal detections exist, we would suggest that this possible caveat be mentioned for this particular experiment.

      This question has also been raised before (R1, point 5). According to an earlier study the mEos4b-GlyRb knock-in does not cause any obvious phenotypes, with the possible exception of minor loss of glycine potency (Maynard et al. 2021 eLife). The fact that the synaptic levels in the spinal cord in heterozygous animals are precisely half of those of homozygous animals argues against differences in receptor expression, heteropentameric assembly, forward trafficking to the plasma membrane and integration into the synaptic membrane as confirmed using quantitative super-resolution CLEM (Maynard et al. 2021 eLife). Accordingly, we did not observe any behavioural deficits in these animals, making it a powerful experimental model. We have added this information in the revised manuscript (p. 4). 

      In addition, without any quantification or statistical analysis, the author’s claims regarding the necessity of GLRA1 expression for the synaptic localization of glycine receptors in cultured hippocampal neurons should probably be described as preliminary (Fig. 5).

      As mentioned before, we have substantially revised this part (R1, point 3). The quantification and analysis in the new Fig. 5 support our earlier interpretation.

      - Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      The authors show that there is colocalization of gephyrin with the mEos4b-GlyRβ subunit using the Dual-colour SMLM. This is a powerful approach that allows for a claim to be made on the synaptic location of the glycine receptors. The images presented in Figure 1, together with the distance analysis in Figure 2, display the co-localization of the fluorophores. The co-localization images in all the selected regions, hippocampus and striatum, also show detections outside of the gephyrin clusters, which the authors refer to as extrasynaptic. These punctated small clusters seem to have the same size as the ones detected and assigned as part of the synapse. It would be informative if the authors analysed the distribution, density and size of these nonsynaptic clusters and presented the data in the manuscript and also compared it against the synaptic ones. Validating this extrasynaptic signal by staining for a dendritic marker, such as MAP-2 or maybe a somatic marker and assessing the co-localization with the non-synaptic clusters would also add even more credibility to them being extrasynaptic. 

      The existence of extrasynaptic GlyRs is well attested in spinal cord neurons (e.g. Specht et al. 2013 Neuron; this study see Fig. S2). The fact that these appear as small clusters of detections in SMLM recordings results from the fact that a single fluorophore can be detected several times in consecutive image frames and because of blinking. Therefore, small clusters of detections likely represent single GlyRs (that can be counted), and not assemblies of several receptor complexes. Due to their diffusion in the neuronal membrane, they are seen as diffuse signals throughout the somatodendritic compartment in epifluorescence images (e.g. Fig. 5A). SMLM recordings of the same cells resolves this diffuse signal into discrete nanoclusters representing individual receptors (Fig. 5B). It is not clear what information co-localisation experiments with specific markers could provide, especially in hippocampal neurons, in which the copy numbers (and density) of GlyRs is next to zero.

      In addition we would encourage the authors to quantify the clustering and co-localization of virally expressed GLRA1, GLRA2, and GLRB with gephyrin in order to support the associated claims (Fig. 5). Preferably, the density of GLR and gephyrin clusters (at least on the somatic surface, the proximal dendrites, or both) as well as their co-localization probability should be quantified if a causal claim about subunit-specific requirements for synaptic localization is to be made.

      Quantification of the data have been carried out (new Fig.5C,D). The results have been described before (R1, point 3) and support our earlier interpretation of the data (pp. 13-14).

      Lastly, even though it may be outside of the scope of such a study analysing other parts of the hippocampal area could provide additional important information. If one looks at the Allen Institute’s ISH of the beta subunit the strongest signal comes from the stratum oriens in the CA1 for example, suggesting that interneurons residing there would more likely have a higher expression of the glycine receptors. This could also be assessed by looking more carefully at the single cell transcriptomics, to see which cell types in the hippocampus show the highest mRNA levels. If the authors think that this is too much additional work, then perhaps a mention of this in the discussion would be good. 

      We have added the requested information from the ISH database of the Allen Institute in the discussion as suggested by the reviewer (p. 12). However, in combination with the transcriptomic data (Fig. S1) our finding strongly suggest that the expression of synaptic GlyRs depends on the availability of alpha subunits rather than on the presence of the GlyRb transcript. This is obvious when one compares the mRNA levels in the hippocampus with those in the basal ganglia (striatum) and medulla. While the transcript concentrations of GlyRb are elevated in all three regions and essentially the same, our data show that the GlyRb copy numbers at synapses differ over more than 2 orders of magnitude (Fig. 1B, Table 1). 

      - Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      Since the labeling and some imaging has been performed already, the requested experiment would be a matter of deploying a method of quantification. In principle, it should not require any additional wet-lab experiments, although it may require additional imaging of existing samples.

      - Are the data and the methods presented in such a way that they can be reproduced?

      Yes, for the most part.

      - Are the experiments adequately replicated and statistical analysis adequate?

      Yes

      Minor comments:

      - Specific experimental issues that are easily addressable.

      N/A

      - Are prior studies referenced appropriately?

      Yes

      - Are the text and figures clear and accurate?

      Yes, although quantification in figure 5 is currently not present.

      A quantification has been added (see R1, point 3).

      - Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

      This paper presents a method that could be used to localize receptors and perhaps other proteins that are in low abundance or for which a detailed quantification is necessary. I would therefore suggest that Figure S4 is included into Figure 2 as the first panel, showcasing the demixing, followed by the results. 

      We agree in principle with this suggestion. However, the revised Fig. S4 is more complex and we think that it would distract from the data shown in Fig. 2. Given that Fig. S4 is mostly methodological and not essential to understand the text, we have kept it in the supplement for the time being. We leave the final decision on this point to the editor.

      Reviewer #4-2 (Significance): 

      [This review was supplied later]

      - Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

      Using a novel and high resolution method, the authors have provided strong evidence for the presence of glycine receptors in the murine hippocampus and in the dorsal striatum. The number of receptors calculated is small compared to the numbers found in the ventral striatum. This is the first study to quantify receptor numbers in these region. In addition it also lays a roadmap for future studies addressing similar questions. 

      - Place the work in the context of the existing literature (provide references, where appropriate).

      This is done well by the authors in the curation of the literature. As stated above, the authors have filled a gap in the presence of glycine receptors in different brain regions, a subject of importance in understanding the role they play in brain activity and function. 

      - State what audience might be interested in and influenced by the reported findings.

      Neuroscientists working at the synaptic level, on inhibitory neurotransmission and on fundamental mechanisms of expression of genes at low levels and their relationship to the presence of the protein would be interested. Furthermore, researchers in neuroscience and cell biology may benefit from and be inspired by the approach used in this manuscript, to potentially apply it to address their own aims. 

      We thank the reviewer for the positive assessment of the technical and biological implications of our work, as well as the interest of our findings to a wide readership of neuroscientists and cell biologists. 

      - Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.

      Synaptic transmission, inhibitory cells and GABAergic synapses functionally and structurally, cortex and cortical circuits. No strong expertise in super-resolution imaging methods.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This very thorough anatomical study addresses the innervation of the Drosophila male reproductive tract. Two distinct glutamatergic neuron types were classified: serotonergic (SGNs) and octopaminergic (OGNs). By expansion microscopy, it was established that glutamate and serotonin /octopamine are co-released. The expression of different receptors for 5-HT and OA in muscles and epithelial cells of the innervation target organs was characterized. The pattern of neurotransmitter receptor expression in the target organs suggests that seminal fluid and sperm transport and emission are subjected to complex regulation. While silencing of abdominal SGNs leads to male infertility and prevents sperm from entering the ejaculatory duct, silencing of OGNs does not render males infertile. 

      Strengths: 

      The studied neurons were analysed with different transgenes and methods, as well as antibodies against neurotransmitter synthesis enzymes, building a consistent picture of their neurotransmitter identity. The careful anatomical description of innervation patterns together with receptor expression patterns of the target organs provides a solid basis for advancing the understanding of how seminal fluid and sperm transport and emission are subjected to complex regulation. The functional data showing that SGNs are required for male fertility and for the release of sperm from the seminal vesicle into the ejaculatory duct is convincing. 

      Weaknesses: 

      The functional analysis of the characterized neurons is not as comprehensive as the anatomical description, and phenotypic characterization was limited to simple fertility assays. It is understandable that a full functional dissection is beyond the scope of the present work. The paper contains experiments showing neuron-independent peristaltic waves in the reproductive tract muscles, which are thematically not very well integrated into the paper. Although very interesting, one wonders if these experiments would not fit better into a future work that also explores these peristaltic waves and their interrelation with neuromodulation mechanistically. 

      Reviewer #2 (Public review): 

      Summary: 

      Cheverra et al. present a comprehensive anatomical and functional analysis of the motor neurons innervating the male reproductive tract in Drosophila melanogaster, addressing a gap in our understanding of the peripheral circuits underlying ejaculation and male fertility. They identify two classes of multi-transmitter motor neurons-OGNs (octopamine/glutamate) and SGNs (serotonin/glutamate)-with distinct innervation patterns across reproductive organs. The authors further characterize the differential expression of glutamate, octopamine, and serotonin receptors in both epithelial and muscular tissues of these organs. Behavioral assays reveal that SGNs are essential for male fertility, whereas OGNs and glutamatergic transmission are dispensable. This work provides a high-resolution map linking neuromodulatory identity to organ-specific motor control, offering a valuable framework to explore the neural basis of male reproductive function. 

      Strengths: 

      Through the use of an extensive set of GAL4 drivers and antibodies, this work successfully and precisely defines the neurons that innervate the male reproductive tract, identifying the specific organs they target and the nature of the neurotransmitters they release. It also characterizes the expression patterns and localization of the corresponding neurotransmitter receptors across different tissues. The authors describe two distinct groups of dual-identity neurons innervating the male reproductive tract: OGNs, which co-express octopamine and glutamate, and SGNs, which co-express serotonin and glutamate. They further demonstrate that the various organs within the male reproductive system differentially express receptors for these neurotransmitters. Based on these findings, the authors propose that a single neuron capable of co-releasing a fast-acting neurotransmitter alongside a slower-acting one may more effectively synchronize and stagger events that require precise timing. This, together with the differential expression of ionotropic glutamate receptors and metabotropic aminergic receptors in postsynaptic muscle tissue, adds an additional layer of complexity to the coordinated regulation of fluid secretion, organ contractility, and directional sperm movement-all contributing to the optimization of male fertility. 

      Weaknesses: 

      The main weakness of the manuscript is the lack of detail in the presentation of the results. Specifically, all microscopy image figures are missing information about the number of samples (N), and in the case of colocalization experiments, quantitative analyses are not provided. Additionally, in the first behavioral section, it would be beneficial to complement the data table with figures similar to those presented later in the manuscript for consistency and clarity. 

      Wider context: 

      This study delivers the first detailed anatomical map connecting multi-transmitter motor neurons with specific male reproductive structures. It highlights a previously unrecognized functional specialization between serotonergic and octopaminergic pathways and lays the groundwork for exploring fundamental neural mechanisms that regulate ejaculation and fertility in males. The principles uncovered here may help explain how males of Drosophila and other organisms adjust reproductive behaviors in response to environmental changes. Furthermore, by shedding light on how multi-transmitter systems operate in reproductive control, this model could provide insights into therapeutic targets for conditions such as male infertility and prostate cancer, where similar neuronal populations are involved in humans. Ultimately, this genetically accessible system serves as a powerful tool for uncovering how multi-transmitter neurons orchestrate coordinated physiological actions necessary for the functioning of complex organs. 

      Reviewer #3 (Public review): 

      Summary: 

      This work provides an overview of the motor neuron landscape in the male reproductive system. Some work had been done to elucidate the circuits of ejaculation in the spine, as well as the cord, but this work fills a gap in knowledge at the level of the reproductive organs. Using complementary approaches, the authors show that there are two types of motor neurons that are mutually exclusive: neurons that co-express octopamine and glutamate and neurons that co-express serotonin and glutamate. They also show evidence that both types of neurons express large dense core vesicles, indicating that neuropeptides play a role in male fertility. This paper provides a thorough characterization of the expression of the different glutamate, octopamine, and serotonin receptors in the different organs and tissues of the male reproductive system. The differential expression in different tissues and organs allows building initial theories on the control of emission and expulsion. Additionally, the authors characterize the expression of synaptic proteins and the neuromuscular junction sites. On a mechanistic level, the authors show that neither octopamine/glutamate neuron transmission nor glutamate transmission in serotonin/glutamate neurons is required for male fertility. This final result is quite surprising and opens up many questions on how ejaculation is coordinated. 

      Strengths: 

      This work fills an important gap in the characterization of innervation of the male reproductive system by providing an extensive characterization of the motor neurons and the potential receptors of motor neuron release. The authors show convincing evidence of glutamate/monoamine co-release and of mutual exclusivity of serotonin/glutamate and octopamine/glutamate neurons. 

      Weaknesses: 

      (1) Often, it is mentioned that the expression is higher or lower or regional without quantification or an indication of the number of samples analysed. 

      (2) The experiment aimed at tracking sperm in the male reproductive system is difficult to interpret when it is not assessed whether ejaculation has occurred. 

      (3) The experiment looking at peristaltic waves in the male organs is missing labeling of the different regions and quantification of the observed waves. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) While the peripheral innervations are very carefully described, it is not clear to which SGNs and OGNs (i.e., cell bodies in the central nervous system) these innervations belong. Are SV, AG, and ED innervated by branches of one neuron or by separate neurons? Multi-color flip-out experiments could provide an answer to this. 

      We agree this is important and are planning these experiments for follow-up study.

      (2) In contrast, for the analysis of the VT19028 split line (Figure 9), only vnc and cell body images are shown. How do the arborisations of these split combinations look in the periphery? Are the same reproductive organs innervated as shown in Figure 2?

      Figure 9S3 was inadvertently omitted from the initial submission.  That figure is now included and shows that the VT019028 split broadly innervates the SV, AG, and ED.

      (3) In the discussion, I think it would be helpful to offer some potential explanations for the role of octopaminergic and glutamatergic signaling. If not required for basic fertility, they probably have some other role.

      Thank you, we have included speculation in the Discussion section "Potential for adaptation to environment".

      (4) Line 543: Figure 8S4 E, (not 8E). 

      Correction made.

      Reviewer #2 (Recommendations for the authors): 

      (1) Line 213-217 

      Comment:

      The use of "significantly less expression" may be misleading, as no quantification or statistical analysis is provided to support this comparison. 

      Suggestion:

      Consider using a more neutral term, such as "markedly less" or "noticeably less," unless quantitative data and statistical analysis are included to substantiate the claim.

      Good recommendation.This suggestion has been incorporated.

      (2) Line 264-267 

      Comment:

      The observation regarding the distinct morphology of SGNs and OGNs is interesting and could strengthen the argument regarding functional differences. 

      Suggestion: 

      Consider including a quantification of morphological complexity (e.g., branching) to support the claim. A method such as Sholl analysis (Sholl, 1953), as adapted in Fernández et al., 2008, could be applied. 

      This is a good suggestion, and we will consider it as part of a follow-up study.

      (3) Line 269-271 

      Comment:

      The anatomical context of the observation is not explicitly stated. 

      Suggestion:

      Add "in the ED" for clarity: "With the TRH-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, A and E) and 6XV5-vMAT (Figure 5S1, B and F) were both present with a highly overlapping distribution (Figure 5S1, I)." 

      Suggestion has been incorporated.

      (4) Line 275-276 

      Comment:

      The claim about the reduced ability to distinguish SGNs and OGNs in the ED would benefit from quantitative support. 

      Suggestion:

      Include a morphological comparison or quantification between SGNs and OGNs in the ED and SV to reinforce this point.

      Certain information on morphological comparison can be inferred within the images themselves, and we will include quantitation in a follow-up study.

      (5) Line 277-279 

      Comment:

      As with line 269, the anatomical site could be specified more clearly. 

      Suggestion: 

      Rephrase as: "With the Tdc2-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, M and Q) and 6XV5-vMAT (Figure 5S1, N and R) were both observed in a highly overlapping distribution (Figure 5S1, U)." 

      Suggestion has been incorporated.

      (6) Line 348-350 

      Comment:

      The phrase "significantly higher density" implies a statistical comparison that is not shown. 

      Suggestion:

      If no quantification is provided, replace with a qualitative term such as "visibly higher" or "notably more dense." Alternatively, add a quantitative analysis with statistical testing to justify the use of "significantly." 

      Suggestion has been incorporated.

      (7) Lines 415-458 (Section comment) 

      Comment:

      There appears to be differential localization of neurotransmitter receptor expression (glutamate in muscle vs. 5-HT in epithelium or neurons), which could have functional implications. 

      Suggestion:

      Expand this section to briefly discuss the differential localization patterns of these receptors and potential implications for signal transduction in male reproductive tissues. 

      (8) Lines 638-682 (Section comment) 

      Comment:

      The table summarizing fertility phenotypes would be more informative with additional detail on experimental outcomes. 

      Suggestion:

      Add a column showing the number of fertile males over the total tested (e.g., "n fertile / n total"). Also, clarify whether the fertility assays are identical to those reported in Figure 10S2, and whether similar analyses were conducted for females. Consider including a figure summarizing fertility results for all genotypes listed in the table, similar to Figure 10S2. 

      The fertility tests reported in Table 1 were separate from those reported in Figure 10S2.  For these tests, the results were clear-cut with 100% of males and females reported as infertile exhibiting the infertile phenotype.  For the males and females reported as fertile, it was also clear-cut with nearly 100% showing fertility at a high level.  In subsequent figures we attempted to assess degrees of fertility.

      (9) Line 724-727 

      Comment:

      There seems to be a mistake in the identification of the driver lines used to silence OA neurons. Also, figure references might be incorrect. 

      Suggestion:

      The OA neuron driver line should be corrected to "Tdc2-GAL4-DBD ∩ AbdB-AD" instead of TRH-GAL4. Additionally, the figure references should be verified; specifically, the letter "B" (in "Figure 10B, D" and "10B, E") appears to be unnecessary or misplaced.

      Thanks for catching this, the corrections have been made.

      (10) Line 872-877 

      Comment:

      The discussion on the co-release of fast-acting glutamate and slower aminergic neurotransmitters is interesting and well-articulated. However, it remains somewhat disconnected from the behavioral findings. 

      Suggestion:

      Consider linking this proposed mechanism to the results observed in the mating duration assays. For instance, the sequential action of neurotransmitters described here could potentially underlie the prolonged mating observed when specific neuromodulators are active, helping to functionally integrate molecular and behavioral data. 

      (11) Line 926-928 

      Comment:

      The interpretation of 5-HT7 receptor expression in the sphincter is compelling, suggesting a role in regulating its function. However, this anatomical observation could be further contextualized with the functional data. 

      Suggestion:

      It may strengthen the interpretation to explicitly connect this finding with the fertility assays, where SGNs - presumably acting via serotonergic signaling - are shown to be necessary for male fertility. This would support a functional role for 5-HT7 in reproductive success via sphincter regulation.

      This has been added. 

      (12) Figure 1 

      Comment:

      The figure legend is generally clear, but could benefit from more consistency and precision in the color-coded labeling. Additionally, the naming of some structures could be more explicit. 

      Suggestion: 

      Revise the figure and the legend as follows:

      Figure 1. The Drosophila male reproductive system. A) Schematic diagram showing paired testes (colour), SVs (green), AGs (purple), Sph (red), ED (gray), and EB (colour). B) Actual male reproductive system. Te - testes, SV - seminal vesicle, AG - accessory gland, Sph - singular sphincter, ED - ejaculatory duct, EB - ejaculatory bulb. Scale bar: 200 µm.

      This suggestion has been incorporated.

      (13) Figure 3S2 

      Comment:

      There appears to be a typographical error in the description of the genotypes, which may lead to confusion. 

      Suggestion:

      Correct the legend to reflect the appropriate genotypes:

      Figure 3S2. Expression of vGlut-LexA and Tdc2-GAL4 in the Drosophila male reproductive system. A, D, G, J, M, P) vGlut-LexA, LexAop-6XmCherry; B, E, H, K, N, Q) Tdc2-GAL4, UAS-6XGFP; C, F, I, L, O, R) Overlay. Scale bars: O - 50 µm; R - 10 µm.

      The corrections have been made.

      (14) Figure 3S3

      Comment:

      The genotypes for panels D and E appear to be incomplete; the DBD component of the split-GAL4 drivers is missing. 

      Suggestion:

      Update the figure legend to: 

      Figure 3S3. Fruitless and Doublesex expression in the Drosophila male reproductive system. A) fru-GAL4, UAS-6XGFP; B) vGlut-LexA, LexAop-6XmCherry; C) Overlay; D) Tdc2-AD ∩ dsx-GAL4-DBD; E) TRH-AD ∩ dsx-GAL4-DBD. Scale bar: 200 µm.

      The corrections have been made.

      (15) Figure 4S4 

      Comment: 

      There is a repeated segment in the figure legend, which makes it unclear and redundant. 

      Suggestion:

      Edit the legend to remove the duplicated lines: 

      Figure 4S4. Expression of vGlut, TβH-GFP, and 5-HT at the junction of the SV and AGs with the ED of the Drosophila male reproductive system. A) vGlut-40XV5; B) TβH-GFP; C) 5-HT; D) vGlut-40XV5, TβH-GFP overlay; E) vGlut-40XV5, 5-HT overlay; F) TβH-GFP, 5-HT overlay. Scale bar: 50 µm.

      The correction has been made.

      (16) Figure 6S5 

      Comment:

      Within this figure, the orientation and/or scale of the tissue varies noticeably between individual panels, making it difficult to directly compare the different experimental conditions. 

      Suggestion:

      For improved clarity and interpretability, consider standardizing the orientation and size of the tissue shown across all panels within the figure. Consistent presentation will facilitate direct comparisons between treatments or genotypes. 

      There is often variation in the size of the male reproductive organs. They were all acquired at the same magnification. The only point of this figure is there is no vGAT or vAChT at these NMJs and the result is unambiguously negative. 

      (17) Figure 10 

      Comment:

      Panel A appears redundant, as it shows the same information as the other panels but without indicating statistical significance. 

      Suggestion:

      Consider removing panel A and keeping only the remaining four graphs, which include relevant statistical comparisons and clearly show significant differences.

      We realize there is some redundancy of panel A with the other panels, but we feel there is value in having all the genotypes in a single panel for comparison.

      Reviewer #3 (Recommendations for the authors): 

      Here are some suggestions to improve the manuscript: 

      (1) Prot B GFP experiment: the authors should explain better the time chosen to look at the sperm content of the male reproductive system. At 10 minutes, it is expected that the male has already ejaculated, and therefore, a failure to ejaculate would result in more sperm in the reproductive system, not less. Since we are not certain when the male ejaculates, it would be important to do the analysis at different time points.

      In the Prot-GFP experiments, the 10-minute time point was chosen because we nearly always observe sperm in the ejaculatory duct of control males.  In the experimental males, we never observed sperm in the ejaculatory duct at this time point.  Also, no Prot-GFP sperm were observed in the reproductive tract of females mated to experimental males even when mating was allowed to go to completion, while abundant sperm were found in females mated to Prot-GFP controls.  Figure 10S1 has been updated to include Images of these female reproductive systems.  The results showing the absence of Prot-GFP sperm in the female reproductive tract mated to experimental males indicates sperm transfer in these males isn't occurring earlier during the copulation process than in control males and that we didn't miss it by only examining at the ejaculatory duct.

      (2) Discuss what may be the role of the octopamine/glutamate neurons and glutamate transmission in serotonin/glutamate neurons in the male reproductive system, given that they are not required for fertility (at least under the context in which it was tested). It is quite a striking result that deserves some attention. 

      We agree it is a surprising result and have included speculation on the role of glutamate and octopamine in male reproduction in the Discussion section "Potential for adaptation to environment".

      (3) Very important: 

      (a) Figure 3 is present in the Word document but not the PDF. 

      (b) Figure 9S3 is not present 

      (c) In Figure 5 X), the legend does not correspond to the panel.

      All of these corrections have been made. 

      (4) Other suggestions:

      (a) A summary schematic (or several) of the findings would make it an easier read.

      (b) Explain why the ejaculatory bulb was left out of the analysis.

      (c) Explain in the main text some of the tools, such as, BONT-C and the conditional vGlut mutation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      In this paper, the authors developed a chemical labeling reagent for P2X7 receptors, called X7-uP. This labeling reagent selectively labels endogenous P2X7 receptors with biotin based on ligand-directed NASA chemistry (Ref. 41). After labeling the endogenous P2X7 receptor with biotin, the receptor can be fluorescently labeled with streptavidin-AlexaFluor647. The authors carefully examined the binding properties and labeling selectivity of X7-uP to P2X7, characterized the labeling site of P2X7 receptors, and demonstrated fluorescence imaging of P2X7 receptors. The data obtained by SDS-PAGE, Western blot, and fluorescence microscopy clearly show that X7-uP labels the P2X7 receptor. Finally, the authors fluorescently labeled the endogenous P2X7 in BV2 cells, which are a murine microglia model, and used dSTORM to reveal a nanoscale P2X7 redistribution mechanism under inflammatory conditions at high resolution. 

      Strengths: 

      X7-uP selectively labels endogenous P2X7 receptors with biotin. Streptavidin-AlexaFluor647 binds to the biotin labeled to the P2X7 receptor, allowing visualization of endogenous P2X7 receptors. 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Weaknesses & Comments 

      (1) The P2X7 receptor exists in a trimeric form. If it is not a monomer under the conditions of the pull-down assay in Figure 2C, the quantitative values may not be accurate. 

      We thank the reviewer for this comment. As shown in Figure 2C, the band observed on the denaturing SDS-PAGE corresponds to the monomeric form of the P2X7 receptor. While we cannot exclude the presence of non-monomeric species under native conditions, no such higher-order forms are visible in the gel. This observation supports the conclusion that the quantitative values presented are based on the monomeric form and are therefore reliable.

      (2) In Figure 3, GFP fluorescence was observed in the cell. Are all types of P2X receptors really expressed on the cell surface ? 

      We thank the reviewer for this excellent comment, which was also raised by reviewer 2. To address this concern, we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X receptors reach the plasma membrane. As expected, all P2X subtypes except P2X6 were detected at the cell surface in HEK293T cells, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (3) The reviewer was not convinced of the advantages of the approach taken in this paper, because the endogenous receptor labeling in this study could also be done using conventional antibody-based labeling methods. 

      We thank the reviewer for raising this important point and would like to highlight several advantages of our approach compared to conventional antibody-based labeling.

      First, commercially available P2X7 antibodies often suffer from poor specificity and are generally not suitable for reliably detecting endogenous P2X7 receptors, as documented in previous studies (e.g., PMID: 16564580 and PMID: 15254086). While recent advances have been made using nanobodies with improved specificity for P2X7 (e.g., PMID: 30074479 and PMID: 38953020), our strategy is distinct and complementary to nanobody-based approaches.

      Second, antibodies rely on non-covalent interactions with the receptor, which can result in dissociation over time. In contrast, our X7-uP probe covalently biotinylates lysine residues on the P2X7 receptor through stable amide bond formation. This covalent labeling ensures that the biotin moiety remains permanently attached, an advantage not afforded by reversible binding strategies.

      Third, by selectively biotinylating P2X7 receptors, our method provides a versatile platform for the chemical attachment of a wide range of probes or functional moieties. Although we did not demonstrate this application in the current study, we believe this modularity represents an additional advantage of our approach.

      We have now revised the discussion to highlight these key advantages, allowing the reader to form their own opinion. We hope this addresses the reviewer’s concerns and clarifies the benefits of our approach.

      (4) Although P2X7 was successfully labeled in this paper, it is not new as a chemistry. There is a need for more attractive functional evaluation such as live trafficking analysis of endogenous P2X7. 

      We agree with the reviewer that the underlying chemistry is not novel per se. However, to our knowledge, it has not previously been applied to the P2X7 receptor, and thus constitutes a novel application with specific relevance for studying native P2X7 biology.

      We also appreciate the reviewer’s suggestion regarding live trafficking analysis of endogenous P2X7. While this is indeed a valuable and interesting direction, we believe it lies beyond the scope of the present study, as it would first require demonstrating that the labeling itself does not affect P2X7 function (see below). This important step would necessitate additional experiments, which we consider more appropriate for a follow-up investigation.

      (5) The reviewer has concerns that the use of the large-size streptavidin to label the P2X7 receptor may perturbate the dynamics of the receptor. 

      We thank the reviewer for raising this important point. Although we did not directly measure receptor dynamics, it is indeed possible that tetrameric streptavidin (tStrept-A 647) could promote P2X7 clustering by cross-linking nearby receptors due to its tetravalency (see also point 7 raised by the reviewer). To address this concern, we performed additional dSTORM experiments using a monomeric form of streptavidin-Alexa 647 (mSA) (see PMID: 26979420). Owing to its reduced size and lack of tetravalency, mSA has been shown to minimize artificial crosslinking of synaptic receptors (PMID: 26979420). A drawback of using mSA, however, is that the monomeric form carries only two fluorophores (estimated degree of labeling, DOL ≈ 2, PMID: 26979420), whereas the tetrameric form, according to the manufacturer’s certificate of analysis (Invitrogen S21374), has an average DOL of three fluorophores per monomer, resulting in a total of ~12 fluorophores per streptavidin.

      We tested three conditions with mSA incubation: (i) control BV2 cells (without X7-uP), (ii) untreated X7-uP-labeled BV2 cells, and (iii) X7-uP-labeled BV2 cells treated with LPS and ATP (using the same concentrations and incubation times described in the manuscript). As shown in Author response image 1, only LPS+ATP treatment induced a clear increase in the mean cluster density compared to quiescent (untreated) BV2 cells. This effect closely matches the results obtained with tStrept-A 647, supporting the conclusion the tetrameric streptavidin does not artificially promote P2X7 clustering. It is also possible that the cellular environment of BV2 microglia differs from the confined architecture of synapses, which may further explain why cross-linking effects are less pronounced in our system.

      As expected, the overall fluorescence signal with mSA was about tenfold lower than with tStrept-A 647, consistent with the expected fluorophore stoichiometry. This lower signal may explain why the values for the untreated condition appeared slightly higher than for the control, although the difference was not statistically significant (P = 0.1455).

      We hope these additional experiments adequately address the reviewer’s concerns.

      Author response image 1.

      BV2 labeling with monomeric streptavidin–Alexa 647 (mSA).(A) Bright-field and dSTORM images of BV2 cells labeled with mSA in the presence (untreated and LPS+ATP) or absence (control) of 1 µM X7-uP. Treatment: LPS (1 µg/mL for 24 hours) and ATP (1 mM for 30 minutes). Scale bars, 10 µm. Insets: Magnified dSTORM images. Scale bars, 1 µm.(B) Quantification of the number of localizations (n = 2 independent experiments). Bars represent mean ± s.e.m. One-way ANOVA with Tukey’s multiple comparisons (P values are indicated above the graph).

      (6) It is better to directly label Alexa647 to the P2X7 receptor to avoid functional perturbation of P2X7. 

      Directly labeling of Alexa647 to the P2X7 receptor would require the design and synthesis of a novel probe, which is currently not available. Implementing such a strategy would involve substantial new experimental work that lies beyond the scope of the present study.

      (7) In all imaging experiments, the addition of streptavidin, which acts as a cross-linking agent, may induce P2X7 receptor clustering. This concern would be dispelled if the receptors were labeled with a fluorescent dye instead of biotin and observed. 

      We refer the reviewer to our response in point 5, where we addressed this concern by comparing tetrameric and monomeric streptavidin conjugates. As noted above (see also point 6), directly labeling the receptor with a fluorescent dye would require the development of a new probe, which is outside the scope of the present study.

      (8) There are several mentions of microglia in this paper, even though they are not used. This can lead to misunderstanding for the reader. The author conducted functional analysis of the P2X7 receptor in BV-2 cells, which are a model cell line but not microglia themselves. The text should be reviewed again and corrected to remove the misleading parts that could lead to misunderstanding. e.g. P8. lines 361-364

      First, it combines N-cyanomethyl NASA chemistry with the high-affinity AZ10606120 ligand, enabling rapid labeling in microglia (within 10 min)

      P8. lines 372-373 

      Our results not only confirm P2X7 expression in microglia, as previously reported (6, 26-33), but also reveal its nanoscale localization at the cell surface using dSTORM. 

      We agree with the reviewer’s comment. We have now modified the text, including the title.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, Arnould et. al. develop an unbiased, affinity-guided reagent to label P2X7 receptor and use super-resolution imaging to monitor P2X7 redistribution in response to inflammatory signaling. 

      Strengths: 

      I think the X7-uP probe that they developed is very useful for visualizing localization of P2X7 receptor. They convincingly show that under inflammatory conditions, there is a reorganization of P2X7 localization into receptor clusters. Moreover, I think they have shown a very clever way to specifically label any receptor of interest. This has broad appeal 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Overall, the manuscript is novel and interesting. However, I do have some suggestions for improvement. 

      (1) While the authors state that chemical modification of AZ10606120 to produce the X7-UP reagent has "minimal impact" on the inhibition of P2X7, we can see from Figure 2A and 2B that it does not antagonize P2X7 as effectively as the original antagonist. For the sake of completeness and quantitation, I think it would be great if the authors could determine the IC50 for X7-uP and compare it to the IC50 of AZ10606120. 

      We thank the reviewer for this insightful comment. Unfortunately, due to the limited availability of X7-uP, we were not able to establish a complete concentration–response curve to determine its IC<sub>50</sub>, which would require testing at concentrations >1 µM. Nevertheless, to estimate the effect of the modification, we assessed current inhibition at 300 µM X7-uP and compared it with the reported IC<sub>50</sub> of AZ10606120 (10 nM). Under these conditions, both compounds produced a similar level of inhibition, indicating that while the chemical modification reduces potency relative to AZ10606120, X7-uP still functions as an effective probe for P2X7. We have now included these data in Figure 2 and revised the text accordingly.

      (2) Do the authors know whether modification of the lysines with biotin affects the receptor's affinity for ATP (or ability to be activated by ATP)? What about P2X7 that has been modified with biotin and then labeled with Alexa 647? For the sake of completeness and quantitation, I think it would be great if the authors could determine the EC50 of biotinylated P2X7 for ATP as well as biotinylated and then Alexa 647 labeled P2X7 for ATP and compare these values to the affinity of unmodified WT P2X7 for ATP.

      We thank the reviewer for raising this important point. At present, we have not determined whether modification of lysine residues with biotin, or subsequent labeling with Alexa647, affects the ATP sensitivity or functional properties of P2X7. However, we believe this does not impact the conclusions of the current study, as all functional assays were conducted prior to X7-uP labeling. The labeling is used here as a terminal "snapshot" to visualize the endogenous receptor without interfering with the functional characterization.

      We fully agree that assessing the functional integrity of P2X7 following biotinylation and fluorophore labeling—such as by determining the EC<sub>50</sub> for ATP—would be essential for studies involving dynamic or post-labeling functional analyses, such as live trafficking. However, as noted earlier in our response to Reviewer 1 (point 4), these experiments lie beyond the scope of the current study.

      (3) It is a little misleading to color the fluorescence signal from mScarlet green (for example, in Figure 3 and Figure 4). The fluorescence is not at the same wavelength as GFP. In fact, the wavelength (570 nm - 610 nm) for emission is closer to orange/red than to green. I think this color should be changed to differentiate the signal of mScarlet from the GFP signal used for each of the other P2X receptor subtypes. 

      As suggested, we changed the mScarlet color to orange for all relevant figures.

      (4) It is my understanding that P2X6 does not form homotrimers. Thus, I was a little surprised to see that the density and distribution of P2X6-GFP in Figure 3 looks very similar to the density and distribution of the other P2X subtypes. Do the authors have an explanation for this? Are they looking at P2X6 protomers inserted into the plasma membrane? Does the cell line have endogenous P2X receptor subtypes? Is Figure 3 showing heterotrimers with P2X6 receptor? A little explanation might be helpful.

      We thank the reviewer for raising this important point. Indeed, it is well established that P2X6 does not form functional channels, which supports the conclusion that it does not form homotrimeric complexes. Although previous studies have shown that P2X6–GFP expression is generally lower, more diffuse, and not efficiently targeted to the cell surface compared with other P2X subtypes (see PMID: 12077178), the similar fluorescence distribution and density observed in our Figure 3 do not imply that P2X6 forms homotrimers.

      We did not directly assess the presence of endogenous P2X6 in our HEK293T cells; however, according to the Human Protein Atlas, there is no detectable P2X6 RNA expression in HEK293 cells (nTPM = 0), indicating that endogenous P2X6 is not expressed in this cell line. To further investigate surface expression (see also point 2 of reviewer 1), we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X6 reaches the plasma membrane. As expected, P2X6 was not detected at the cell surface in HEK293T cells, whereas GFP-tagged P2X1 to P2X5 were readily detected. These results further support the conclusion that P2X6 does not insert into the plasma membrane as a homotrimer, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (5) It is easy to overlook the fact that the antagonist leaves the binding pocket once the biotin has been attached to the lysines. It might be helpful if the authors made this a little more apparent in Figure 1 or in the text describing the NASA chemistry reaction.

      We thank the reviewer for this insightful suggestion. To address this, we have modified Figure 1A and updated the legend.

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript describes the development of a covalent labeling probe (X7-uP) that selectively targets and tags native P2X7 receptors at the plasma membrane of BV2 microglial cells. Using super-resolution imaging (dSTORM), the authors demonstrate that P2X7 receptors form nanoscale clusters upon microglial activation by lipopolysaccharide (LPS) and ATP, correlating with synergistic IL-1β release. These findings advance understanding of P2X7 reorganization during inflammation and provide a generalizable labeling strategy for monitoring endogenous P2X7 in immune cells. 

      Strengths: 

      (1) The authors designed X7-uP by coupling a high-affinity, P2X7-specific antagonist (AZ10606120) with N-cyanomethyl NASA chemistry to achieve site-directed biotinylation. This approach offers high specificity, minimal off-target reactivity, and a straightforward pull-down/imaging readout. 

      (2) The results connect P2X7's nanoscale clustering directly with IL-1β secretion in microglia, reinforcing the role of P2X7 in inflammation. By localizing endogenous P2X7 at single-molecule resolution, the authors reveal how LPS priming and ATP stimulation synergistically reorganize the receptor. 

      (3) The authors systematically validate their method in recombinant systems (HEK293 cells) and in BV2 cells, showing selective inhibition, mutational confirmation of the binding site, and Western blot pulldown experiments.

      We thank the reviewer for their positive comment.

      Weaknesses: 

      (1) While the data strongly indicate that P2X7 clustering contributes to IL-1β release, the manuscript would benefit from additional experiments (if feasible) or discussion on how receptor clustering interfaces with downstream inflammasome assembly. Clarification of whether the P2X7 clusters physically colocalize with known inflammasome proteins would solidify the mechanism. 

      We thank the reviewer for this valuable suggestion. Determining the physical colocalization of P2X7 clusters with known inflammasome components would provide important insight into the molecular partners involved in inflammasome activation. However, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      Nevertheless, in response to the reviewer’s suggestion, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation. We also revised the text to tone down the hypothesis of physical colocalization.

      (2) The authors might expand on the scope of X7-uP in other native cells that endogenously express P2X7 (e.g., macrophages, dendritic cells). Although they mention the possibility, demonstrating the probe's applicability in at least one other primary immune cell type would strengthen its general utility. 

      We thank the reviewer for this valuable suggestion. Again, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      (3) The authors do include appropriate negative controls, yet providing additional details (e.g., average single-molecule on-time or blinking characteristics) in supplementary materials could help readers assess cluster calculations. 

      As suggested, we have included additional data showing single-molecule blinking events in untreated and LPS+ATP-treated BV2 cells, along with the corresponding movies. The data are now presented in Figure 5—supplement figure 3A and B and Figure 5—Videos 1 and 2.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) On line 96, the authors refer to the "ballast" domain of P2X7 receptor but do not cite the original article from which this nomenclature originated (McCarthy et al., 2019, Cell). This article should be cited to give appropriate credit. 

      Done.

      (2) On line 602, the authors state that they use models from PDB 1MK5 and 6U9W to generate the cartoons in Figure 6. The manuscripts from which these PDB files were generated need to be appropriately cited. 

      Done.

      (3) On line 319, the authors say "300 mM BzATP" but I think they mean 300 uM.

      Done. Thank you for catching the typo.

      Reviewer #3 (Recommendations for the authors): 

      Overall, excellent data quality. The paper would benefit from a discussion of the physiological implications of clustering. It would also be helpful to elaborate about the potential mechanisms for clustering: diffusion and/or insertion. Finally, the authors should comment on work by Mackinnon's (PMID: 39739811) and Santana lab (PMID: 31371391) on two distinct models for clustering of proteins. 

      As suggested by the reviewer, we have revised the discussion to incorporate their comments. First, we have added the following text:

      “Upon BV2 activation, we observed significant nanoscale reorganization of P2X7. Both LPS and ATP (or BzATP) trigger P2X7 upregulation and clustering, increasing the overall number of surface receptors and the number of receptors per cluster, from one to three (Figure 6). By labeling BV2 cells with X7-uP shortly after IL-1b release, we were able to correlate the nanoscale distribution of P2X7 with the functional state of BV2 cells, consistent with the two-signal, synergistic model for IL-1b secretion observed in microglia and other cell types (Ferrari et al, 1996; Perregaux et al, 2000; Ferrari et al, 2006; Di Virgilio et al, 2017; He et al, 2017; Swanson et al, 2019). In this model, LPS priming leads to intracellular accumulation of pro-IL-1b, while ATP stimulation activates P2X7, triggering NLRP3 inflammasome activation and the subsequent release of mature IL-1b.

      What is the mechanism underlying P2X7 upregulation that leads to an overall increase in surface receptors—does it result from the lateral diffusion of previously masked receptors already present at the plasma membrane, or from the insertion of newly synthesized receptors from intracellular pools in response to LPS and ATP? Although our current data do not distinguish between these possibilities, a recent study suggests that the a1 subunit of the Na<sup>+</sup>/K</sup>+</sup>-ATPase (NKAa1) forms a complex with P2X7 in microglia, including BV2 cells, and that LPS+ATP induces NKAa1 internalization (Huang et al, 2024). This internalization appears to release P2X7 from NKAa1, allowing P2X7 to exist in its free form. We speculate that the internalization of NKAa1 induced by both LPS and ATP exposes previously masked P2X7 sites, including the allosteric AZ10606120 sites, thus making them accessible for X7-uP labeling.”

      Second, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation:

      “What mechanisms underlie P2X7 clustering in response to inflammatory signals? Several models have been proposed to explain membrane protein clustering, including recruitment to structural scaffolds (Feng & Zhang, 2009), partitioning into membrane domains enriched in specific chemical components such as lipid rafts (Simons & Ikonen, 1997), and self-assembly mechanisms (Sieber et al, 2007). These self-assembly mechanisms include an irreversible stochastic model (Sato et al, 2019) and a more recent reversible self-oligomerization model which gives rise to higher-order transient structures (HOTS) (Zhang et al, 2025). Supported by cryogenic optical localization microscopy with very high resolution (~5 nm), the HOTS model has been observed in various membrane proteins, including ion channels and receptors (Zhang et al, 2025). Furthermore, HOTS are suggested to be dynamically modulated and to play a functional role in cell signaling, potentially influencing both physiological and pathological processes (Zhang & MacKinnon, 2025). While this hypothesis is compelling, our current dSTORM data lack sufficient spatial resolution to confirm whether P2X7 trimers form HOTS via self-oligomerization. Further biophysical and ultra-high-resolution imaging studies are required to test this model in the context of P2X7 clustering.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Pournejati et al investigates how BK (big potassium) channels and CaV1.3 (a subtype of voltage-gated calcium channels) become functionally coupled by exploring whether their ensembles form early-during synthesis and intracellular trafficking-rather than only after insertion into the plasma membrane. To this end, the authors use the PLA technique to assess the formation of ion channel associations in the different compartments (ER, Golgi or PM), single-molecule RNA in situ hybridization (RNAscope), and super-resolution microscopy.

      Strengths:

      The manuscript is well written and addresses an interesting question, combining a range of imaging techniques. The findings are generally well-presented and offer important insights into the spatial organization of ion channel complexes, both in heterologous and endogenous systems.

      Weaknesses:

      The authors have improved their manuscript after revisions, and some previous concerns have been addressed.

      Still, the main concern about this work is that the current experiments do not quantitatively or mechanistically link the ensembles observed intracellularly (in the endoplasmic reticulum (ER) or Golgi) to those found at the plasma membrane (PM). As a result, it is difficult to fully integrate the findings into a coherent model of trafficking. Specifically, the manuscript does not address what proportion of ensembles detected at the PM originated in the ER. Without data on the turnover or halflife of these ensembles at the PM, it remains unclear how many persist through trafficking versus forming de novo at the membrane. The authors report the percentage of PLApositive ensembles localized to various compartments, but this only reflects the distribution of pre-formed ensembles. What remains unknown is the proportion of total BK and Ca<sub>V</sub>1.3 channels (not just those in ensembles) that are engaged in these complexes within each compartment. Without this, it is difficult to determine whether ensembles form in the ER and are then trafficked to the PM, or if independent ensemble formation also occurs at the membrane. To support the model of intracellular assembly followed by coordinated trafficking, it would be important to quantify the fraction of the total channel population that exists as ensembles in each compartment. A comparable ensemble-to-total ratio across ER and PM would strengthen the argument for directed trafficking of pre-assembled channel complexes.

      We appreciate the reviewer’s thoughtful comment and agree that quantitatively linking intracellular hetero-clusters to those at the plasma membrane is an important and unresolved question. Our current study does not determine what proportion of ensembles at the plasma membrane originated during trafficking. It also does not quantify the fraction of total BK and Ca<sub>V</sub>1.3 channels engaged in these complexes within each compartment. Addressing this requires simultaneous measurement of multiple parameters—total BK channels, total Ca<sub>V</sub>1.3 channels, hetero-cluster formation (via PLA), and compartment identity—in the same cell. This is technically challenging. The antibodies used for channel detection are also required for the proximity ligation assay, which makes these measurements incompatible within a single experiment.

      To overcome these limitations, we are developing new genetically encoded tools to enable real-time tracking of BK and Ca<sub>V</sub>1.3 dynamics in live cells. These approaches will enable us to monitor channel trafficking and the formation of hetero-clusters, as detected by colocalization. This kind of experiments will provide insight into their origin and turnover. While these experiments are beyond the scope of the current study, the findings in our current manuscript provide the first direct evidence that BK and CaV channels can form hetero-clusters intracellularly prior to reaching the plasma membrane. This mechanistic insight reveals a previously unrecognized step in channel organization and lays the foundation for future work aimed at quantifying ensemble-to-total ratios and determining whether coordinated trafficking of pre-assembled complexes occurs.

      This limitation is acknowledged in the discussion section, page 23. It reads: “Our findings highlight the intracellular assembly of BK-Ca<sub>V</sub>1.3 hetero-clusters, though limitations in resolution and organelle-specific analysis prevent precise quantification of the proportion of intracellular complexes that ultimately persist on the cell surface.”

      Reviewer #2 (Public review):

      Summary:

      The co-localization of large conductance calcium- and voltage activated potassium (BK) channels with voltage-gated calcium channels (CaV) at the plasma membrane is important for the functional role of these channels in controlling cell excitability and physiology in a variety of systems.

      An important question in the field is where and how do BK and CaV channels assemble as 'ensembles' to allow this coordinated regulation - is this through preassembly early in the biosynthetic pathway, during trafficking to the cell surface or once channels are integrated into the plasma membrane. These questions also have broader implications for assembly of other ion channel complexes

      Using an imaging based approach, this paper addresses the spatial distribution of BKCaV ensembles using both overexpression strategies in tsa201 and INS-1 cells and analysis of endogenous channels in INS-1 cells using proximity ligation and superesolution approaches. In addition, the authors analyse the spatial distribution of mRNAs encoding BK and Cav1.3.

      The key conclusion of the paper that BK and Ca<sub>V</sub>1.3 are co-localised as ensembles intracellularly in the ER and Golgi is well supported by the evidence.However, whether they are preferentially co-translated at the ER, requires further work. Moreover, whether intracellular pre-assembly of BK-Ca<sub>V</sub>1.3 complexes is the major mechanism for functional complexes at the plasma membrane in these models requires more definitive evidence including both refinement of analysis of current data as well as potentially additional experiments.

      The reviewer raises the question of whether BK and Ca<sub>V</sub>1.3 channels are preferentially co-translated. In fact, I would like to propose that co-translation has not yet been clearly defined for this type of interaction between ion channels. In our current work, we 1) observed the colocalization between BK and Ca<sub>V</sub>1.3 mRNAs and 2) determined that 70% of BK mRNA in active translation also colocalizes with Ca<sub>V</sub>1.3 mRNA. We think these results favor the idea of translational complexes that can underlie the process of co-translation. However, and in total agreement with the Reviewer, the conclusion that the mRNA for the two ion channels is cotranslated would require further experimentation. For instance, mRNA coregulation is one aspect that could help to define co-translation. 

      To avoid overinterpretation, we have revised the manuscript to remove references to “co-translation” in the Results section and included the word “potential” when referring to co-translation in the Discussion section. We also clarified the limitations of our evidence in the Discussion that can be found on page 25: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess co-translation.”

      Strengths & Weaknesses

      (1) Using proximity ligation assays of overexpressed BK and CaV1.3 in tsa201 and INS1 cells the authors provide strong evidence that BK and CaV can exist as ensembles (ie channels within 40 nm) at both the plasma membrane and intracellular membranes, including ER and Golgi. They also provide evidence for endogenous ensemble assembly at the Golgi in INS-1 cells and it would have been useful to determine if endogenous complexes are also observe in the ER of INS-1 cells. There are some useful controls but the specificity of ensemble formation would be better determined using other transmembrane proteins rather than peripheral proteins (eg Golgi 58K).

      We thank the reviewer for their thoughtful feedback and for recognizing the strength of our proximity ligation assay data supporting BK–Ca<sub>V</sub>1.3 hetero-clusters formation at both the plasma membrane and intracellular compartments. As for specificity controls, we appreciate the suggestion to use transmembrane markers. To strengthen our conclusion, we have performed an additional experiment comparing the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 and BK channels with the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 channels and ryanodine receptors in INS-1 cells. As shown in the figure below, the number of interactions between Ca<sub>V</sub>1.3 and BK channels is significantly higher than that between Ca<sub>V</sub>1.3 and RyR<sub>2</sub>. Of note, RyR<sub>2</sub> is a protein resident of the ER. These results provide additional evidence of the existence of endogenous complex formation in INS-1 cells. We have added this figure as a supplement.

      (2) Ensemble assembly was also analysed using super-resolution (dSTORM) imaging in INS-1 cells. In these cells only 7.5% of BK and CaV particles (endogenous?) co-localise that was only marginally above chance based on scrambled images. More detailed quantification and validation of potential 'ensembles' needs to be made for example by exploring nearest neighbour characteristics (but see point 4 below) to define proportion of ensembles versus clusters of BK or Cav1.3 channels alone etc. For example, it is mentioned that a distribution of distances between BK and Cav is seen but data are not shown.

      We thank the reviewer for this comment. To address the request for more detailed quantification and validation of ensembles, we performed additional analyses:

      Proportion of ensembles vs isolated clusters: We quantified clusters within 200 nm and found that 37 ± 3% of BK clusters are near one or more CaV1.3 clusters, whereas 15 ± 2% of CaV1.3 clusters are near BK clusters. Figure 8– Supplementary 1A

      Distance distribution: As shown in Figure 8–Supplementary 1B, the nearestneighbor distance distribution for BK-to-CaV1.3 in INS-1 cells (magenta) is shifted toward shorter distances compared to randomized controls (gray), supporting preferential localization of BK–CaV1.3 hetero-clusters.

      Together, these analyses confirm that BK–CaV1.3 ensembles occur more frequently than expected by chance and exhibit an asymmetric organization favoring BK proximity to CaV1.3 in INS-1 cells. We have included these data and figures in the revised manuscript, as well as description in the Results section. 

      (3) The evidence that the intracellular ensemble formation is in large part driven by cotranslation, based on co-localisation of mRNAs using RNAscope, requires additional critical controls and analysis. The authors now include data of co-localised BK protein that is suggestive but does not show co-translation. Secondly, while they have improved the description of some controls mRNA co-localisation needs to be measured in both directions (eg BK - SCN9A as well as SCN9A to BK) especially if the mRNAs are expressed at very different levels. The relative expression levels need to be clearly defined in the paper. Authors also use a randomized image of BK mRNA to show specificity of co-localisation with Cav1.3 mRNA, however the mRNA distribution would not be expected to be random across the cell but constrained by ER morphology if cotranslated so using ER labelling as a mask would be useful?

      We thank the reviewer for these constructive suggestions. We measured mRNA colocalization in both directions as recommended. As shown in the figure below, colocalization between KCNMA1 and SCN9A transcripts was comparable in both directions, with no statistically significant difference, supporting the specificity of the observed associations. We decided not to add this to the original figure to keep the figure simple. 

      We agree that co-localization of BK protein with BK mRNA is not conclusive evidence of co-translation, and we do not intend to mislead readers in our conclusion. Consequently, we were careful in avoiding the use of co-translation in the result section and added the word “potential” when referring to co-translation in the Discussion section. We added a sentence in the discussion to caution our interpretation: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess cotranslation.”

      Author response image 1.

      (4) The authors attempt to define if plasma membrane assemblies of BK and CaV occur soon after synthesis. However, because the expression of BK and CaV occur at different times after transient transfection of plasmids more definitive experiments are required. For example, using inducible constructs to allow precise and synchronised timing of transcription. This would also provide critical evidence that co-assembly occurs very early in synthesis pathways - ie detecting complexes at ER before any complexes 

      We appreciate the reviewer’s insightful suggestion regarding the use of inducible constructs to synchronize transcription timing. This is an excellent approach and would allow direct testing of whether co-assembly occurs early in the synthesis pathway, including detection of complexes at the ER prior to plasma membrane localization. These experiments are beyond the scope of the present work but represent an important direction for future studies.

      We have added the following sentence to the Discussion section (page 24) to highlight this idea. “Future experiments using inducible constructs to precisely control transcription timing will enable more precise quantification of heterocluster formation in the ER compartment prior to plasma membrane insertion and reduce the variability introduced by differences in expression timing after plasmid transfection.” 

      (5) While the authors have improved the definition of hetero-clusters etc it is still not clear in superesolution analysis, how they separate a BK tetramer from a cluster of BK tetramers with the monoclonal antibody employed ie each BK channel will have 4 binding sites (4 subunits in tetramer) whereas Cav1.3 has one binding site per channel. Thus, how do authors discriminate between a single BK tetramer (molecular cluster) with potential 4 antibodies bound compared to a cluster of 4 independent BK channels.

      We appreciate the reviewer’s thoughtful comment regarding the interpretation of super-resolution data. We agree that distinguishing a single BK tetramer from a cluster of multiple BK channels is challenging when using an antibody that can bind up to four sites per channel. To clarify, our analysis does not attempt to resolve individual subunits within a tetramer; rather, it focuses on the nanoscale spatial proximity of BK and Ca<sub>V</sub>1.3 signals.

      We want to note that this limitation applies only to the super-resolution maps in Figures 8C and 9D and does not affect Airyscan-based analyses or measurements of BK–Ca<sub>V</sub>1.3 proximity.

      To address how we might distinguish between a single BK tetramer and a cluster of multiple BK channels, we considered two contrasting scenarios. In the first case, we assume that all four α-subunits within a tetramer are labeled. Based on cryoEM structures, a BK tetramer measures approximately 13 nm × 13 nm (≈169 nm²). Adding two antibody layers (primary and secondary) would increase the footprint by ~14 nm in each direction, resulting in an estimated area of ~41 nm × 41 nm (≈1681 nm²). Under this assumption, particles smaller than ~1681 nm² would likely represent individual tetramers, whereas larger particles would correspond to clusters of multiple tetramers. 

      In the second scenario, we propose that steric constraints at the S9–S10 segment, where the antibody binds, limit labeling to a single antibody per tetramer. If true, the localization precision would approximate 14 nm × 14 nm—the combined size of the antibody complex and the channel—close to the resolution limit of the microscope. To test this, we performed a control experiment using two antibodies targeting the BK C-terminal domain, raised in different species and labeled with distinct fluorophores. Super-resolution imaging revealed that only ~12% of particles were colocalized, suggesting that most channels bind a single antibody.

      If multiple antibodies could bind each tetramer, we would expect much greater colocalization.

      Although these data are not included in the manuscript, we have added the following clarification to the Results section (page 19): “It is important to note that this technique does not allow us to distinguish between labeling of four BK αsubunits within a tetramer and labeling of multiple BK channel clusters. Hence, particles smaller than ~1680 nm² may represent either a single tetramer or a cluster. This limitation applies to Figures 8C and 9D and does not affect measurements of BK–Ca<sub>V</sub>1.3 proximity.”

      Author response image 2.

      (6) The post-hoc tests used for one way ANOVA and ANOVA statistics need to be defined throughout

      We thank the reviewer for highlighting the need for clarity regarding our statistical analyses. We have now specified the post-hoc tests used for all one-way ANOVA and ANOVA comparisons throughout the manuscript, and updated figure legends.

      Reviewer #3 (Public review):

      Summary:

      The authors present a clearly written and beautifully presented piece of work demonstrating clear evidence to support the idea that BK channels and Cav1.3 channels can co-assemble prior to their assertion in the plasma membrane.

      Strengths:

      The experimental records shown back up their hypotheses and the authors are to be congratulated for the large number of control experiments shown in the ms.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have sufficiently addressed the specific points previously raised and the manuscript has improved clarity in those aspects. My main concern, which still remains, is stated in the public review.

      Reviewer #3 (Recommendations for the authors):

      I am content that the authors have attempted to fully address my previous criticisms.

      I have only three suggestions

      (1) I think the word Homo-clusters at the bottom right of Figure 1 is erroneously included.

      We thank the reviewer for bringing this to our attention. The figure has been corrected accordingly.

      (2) The authors should, for completeness, to refer to the beta, gamma and LINGO subunit families in the Introduction and include appropriate references:

      Knaus, H. G., Folander, K., Garcia-Calvo, M., Garcia, M. L., Kaczorowski, G. J., Smith, M., & Swanson, R. (1994). Primary sequence and immunological characterization of betasubunit of high conductance Ca2+-activated K+ channel from smooth muscle. The Journal of Biological Chemistry, 269(25), 17274-17278.

      Brenner, R., Jegla, T. J., Wickenden, A., Liu, Y., & Aldrich, R. W. (2000a). Cloning and functional characterization of novel large conductance calcium-activated potassium channel beta subunits, hKCNMB3 and hKCNMB4. The Journal of Biological Chemistry, 275(9), 6453-6461.

      Yan, J & R.W. Aldrich. (2010) LRRC26 auxiliary protein allows BK channel activation at resting voltage without calcium. Nature. 466(7305):513-516

      Yan, J & R.W. Aldrich. (2012) BK potassium channel modulation by leucine-rich repeatcontaining proteins. Proceedings of the National Academy of Sciences 109(20):7917-22

      Dudem, S, Large RJ, Kulkarni S, McClafferty H, Tikhonova IG, Sergeant, GP, Thornbury, KD, Shipston, MJ, Perrino BA & Hollywood MA (2020). LINGO1 is a novel regulatory subunit of large conductance, Ca2+-activated potassium channels. Proceedings of the National Academy of Sciences 117 (4) 2194-2200

      Dudem, S., Boon, P. X., Mullins, N., McClafferty, H., Shipston, M. J., Wilkinson, R. D. A., Lobb, I., Sergeant, G. P., Thornbury, K. D., Tikhonova, I. G., & Hollywood, M. A. (2023). Oxidation modulates LINGO2-induced inactivation of large conductance, Ca2+-activated potassium channels. The Journal of Biological Chemistry, 299 (3) 102975.

      We agree with the reviewer’s suggestion and have revised the Introduction to include references to the beta, gamma, and LINGO subunit families. Appropriate citations have been added to ensure completeness and contextual relevance.

      Additionally, BK channels are modulated by auxiliary subunits, which fine-tune BK channel gating properties to adapt to different physiological conditions. The β, γ, and LINGO1 subunits each contribute distinct structural and regulatory features: β-subunits modulate Ca²⁺ sensitivity and can induce inactivation; γ-subunits shift voltage-dependent activation to more negative potentials; and LINGO1 reduces surface expression and promotes rapid inactivation (18-24). These interactions ensure precise control over channel activity, allowing BK channels to integrate voltage and calcium signals dynamically in various cell types.

      (3) I think it may be more appropriate to include the sentence "The probes against the mRNAs of interest and tested in this work were designed by Advanced Cell Diagnostics." (P16, right hand column, L12-14) in the appropriate section of the Methods, rather than in Results.

      We thank the reviewer for this helpful suggestion. In response, we have relocated the sentence to the appropriate section of the Methods, where it now appears with relevant context.

    1. Note: This response was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary:

      The manuscript titled "Unravelling the Progression of the Zebrafish Primary Body Axis with Reconstructed Spatiotemporal Transcriptomics" presents a comprehensive analysis of the development of the primary body axis in zebrafish by integrating bulk RNA-seq, 3D images, and Stereo-Seq. The authors first clearly demonstrate the application of Palette for integrating RNA-seq and Stereo-Seq using published spatial transcriptomics data of Drosophila embryos. Subsequently, they produced serial bulk RNA-seq data for certain developmental stages of Danio rerio embryos and utilized published Stereo-Seq data. Through robust validation, the authors observe the molecular network involved in AP axis formation. While the authors show that integrating bulk RNA-seq data with Stereo-Seq improves spatial resolution, additional proof is required to demonstrate the extent of this improvement.

      Response: We thank the reviewer for the positive feedback on our Palette pipeline, zSTEP construction and analysis of primary body axis development. We appreciate the constructive suggestions provided, which we can implement to improve our manuscript. As pointed out by the reviewer, some analysis procedures were not described in sufficient detail. To address this, we have added more explanatory texts and additional schematic diagrams to make the methods clearer and more understandable. We also thank the reviewer for the meticulous reading and for reminding us to include parameters, references and essential texts, which significantly improve the manuscript quality and make the manuscript more rigorous. Furthermore, as suggested by the reviewer, the extent of the improvement on the spatial resolution was not clearly demonstrated in the manuscript. Therefore, we have provided an additional figure to show the original expression on the stacked Stereo-seq slices and 3D live image compared to the expression from zSTEP, and the results indicate that zSTEP provides better, more continuous expression patterns. We still have two remaining tasks that are expected to be completed within the next month. We hope our responses have address the concerns raised by the reviewer, and we are pleased to provide any additional proof as needed.

      Major Comments:

      1. Lines 66-68: Discuss the limitations of existing tools and explicitly state the advantages of using Palette.

      Response: We thank the reviewer for the valuable suggestion. We have added the following new texts after line 68 to emphasize the features and advantages of Palette.

      "Newly developed tools are committed to integrating bulk and/or scRNA-seq data with ST data to enhance spatial resolution, focusing on expression at the spot level. However, gene expression patterns are closely correlated to the biological functions and are more critical for understanding biological processes. Therefore, a tool focusing on inferring spatial gene expression patterns would be desirable."

      1. Body Pattern Genes Analysis: For both Drosophila and Danio rerio, it would be valuable to examine body pattern genes in Stereo-Seq and apply Palette to determine if the resolution of the segments improves or merges. The resolution of the A-P axis is convincing, but further evidence for other segments would be beneficial.

      Response: We thank the reviewer for the suggestions. For the Drosophila data, we only used two adjacent slices for Palette performance assessment, and thus were only able to evaluate the expression patterns within the slice.

      For the zebrafish data, although we have construct zSTEP as a 3D transcriptomic atlas, we have to admit that the left-right (LR) and dorsal-ventral (DV) patterning is not satisfactory enough. Here we show a section from the dorsal part of 16 hpf zSTEP that displays a relatively well-defined left-right pattern (Fig. 2). Along the left-right axis, the notochord cells are centrally located, flanked by somite cells on either side, with the outermost cells being pronephros.

      One reason for the limited LR and DV patterning is that the original annotation of the ST data does not clearly distinguish all the cell types. Another reason is likely due to the disordered cell positions when stacking ST slices. Thus, our zSTEP is most suitable for investigating the AP patterns, while the performances on LR and DV patterns may not achieve the same level of accuracy.

      See response letter for the figure.

      1. Figure 2d: Include the A-P line for which the intensity profile was plotted in the main figure, rather than just in the supplementary material. Additionally, consider simplifying the plot by not combining three lines into one, as it complicates the interpretation of observations.

      Response: We thank the reviewer for the helpful suggestions. We have updated Figure 2d and Figure S1b by adding a A-P line on each subfigure (Fig. 3). Additionally, as the reviewer suggested, we have separated the intensity plots so that each subfigure now includes a dedicated intensity plot along A-P axis.

      See response letter for the figure.

      1. Drosophila Data Analysis: While the alignment and validation of Danio rerio sections are clearly explained, the analysis and validation of Drosophila data are insufficiently detailed. Provide a more thorough explanation of how the intensity profiles between BDGP in situ data and Stereo-Seq data are adjusted.

      Response: We thank the reviewer for raising this issue. To make the analysis procedure clearer, we have updated Figure 2a (Fig. 4) and added explanatory texts in the figure legends to describe the processing procedure for the Drosophila ST data.

      See response letter for the figure.

      Additionally, the following sentences have been added into the Methods section to describe the generation of the intensity profiles.

      "The intensity plot profiles along AP axis were generated through the following steps: The expression pattern plot images or in situ hybridization images were imported into ImageJ and converted to grayscale. The colour was then inverted, and a line of a certain width (here set as 10) was drawn across from the anterior part to the posterior part (Fig. S1a). The signal intensities along the width of the line were measured and imported into R for generating intensity plots."

      1. Figure 3d: Present a plot with the expected expression profiles of the three genes if the embryo is aligned as anticipated.

      Response: We thank the reviewer for this helpful suggestion, which improves the clarity of our manuscript. We have added the following subfigure in as Figure 3d (Fig. 5) to show the expected expression profiles of the three midline genes along left-right axis.

      See response letter for the figure.

      1. Analysis Without Palette: Between lines 277-438, the outcome of using Palette with bulk RNA-seq and Stereo-Seq is convincing. However, consider the following:

      o What would be the observations if the analysis were conducted solely with Stereo-Seq data, without incorporating bulk RNA-seq data and employing Palette?

      Response: We thank the reviewer for raising this important question. Here we show the comparison of ST expression on stacked Stereo-seq slices, ST expression projected on 3D live images, and the Palette-inferred expression (Fig. 6). The stacked ST slices do not fully reflect the zebrafish morphology, and the gene expression appears sparse, making it look massive (the first row). While after projecting ST expression onto the live image, the expression patterns can be observed on zebrafish morphology, but the expression is still sparsely distributed in spots (the second row). However, the expression patterns captured by Palette in zSTEP show more continuous expression patterns (the third row), which are more similar to the observations in in situ hybridization images (the fourth row). We are considering put these analyses into the supplementary figure.

      See response letter for the figure.

      o This study uses only Stereo-Seq as the spatial transcriptomics reference. It would strengthen the argument to use at least one other spatial transcriptomics method, such as Visium or MERFISH, in conjunction with bulk RNA-seq and Palette, to demonstrate whether Palette consistently improves gene expression resolution.

      Response: We thank the reviewer for raising this professional question. To demonstrate a broad application of Palette, it would be necessary to test Palette performance using different types of ST references. We plan to perform extra analyses to evaluate Palette performance using Visium and MERFISH data as ST references, respectively. Additionally, our Palette pipeline only takes the overlapped genes for inference. As only hundreds of genes can be detected by MERFISH, Palette can only infer the expression patterns of these genes. As mentioned in the work of Liu et al. (2023), MERFISH can independently resolve distinct cell types and spatial structures, and thus we believe Palette will also show great performance when using MERFISH as ST reference. We've already started the analyses and expect to accomplish it within the next month. And we will update the analyses as separated tutorials to the GitHub repository.

      Reference:

      Liu, J. et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance 6 (2023).

      1. PDAC Data Analysis: Provide a more detailed explanation of the PDAC data analysis and use appropriate colors in the tissue images to clearly distinguish cell types.

      Response: We thank the reviewer for the suggestions. We have updated the colours used in the tissue images to be consistent to the colours in tissue clustering analysis. Additionally, we have added an additional subfigure in supplementary figure (Fig. 7) with more explanatory texts in the figure legends to provide a more thorough explanation for the analysis.

      See response letter for the figure.

      1. Comparison with Other Methods: State the limitations of not using STitch3D and Spateo for alignment and explain why these methods were not employed.

      Response: We thank the reviewer for raising this constructive comment. We fully agree with you that the introduction of published alignment algorithms would be helpful in our analysis. Currently, the slice alignment is adjusted manually, and thus the main limitation of not using these tools is that manual operation may induce bias compared to the alignment generated by computational algorithm. Unfortunately, STitch3D and Spateo are not included in this study because of two reasons. First, these two newly developed tools have been recently posted, and our analyses were largely completed before that. Therefore, we only mentioned these tools in the Discussion section. Second, we do not want to embed too many external tools into our analysis, which may increase the difficulties for researchers' operation. Specifically, STitch3D and Spateo are configured to run in Python environment, while Palette is based on R packages. Moreover, without these tools, our current manual alignment also achieves desired performance. However, we value this enlightening suggestion by the reviewer and therefore plan to further compare the performance of manual alignment versus the mentioned two alignment tools. At present, we have a preliminary comparison scheme and collected relevant datasets. Hopefully, we will complete this analysis within the next 1 to 2 weeks.

      Minor Comments:

      1. References: Add references to the statements in lines 51-53.

      Response: We thank the reviewer for reminding us of the missing references. We have added the works of Junker et al. (2014), Liu et al. (2022), Chen et al. (2022), Wang et al. (2022), Shi et al. (2023) and Satija et al. (2015) as references in line 53 as follows.

      "Thus, great efforts are ongoing to construct gene expression maps of these models with higher resolution, depth, and comprehensiveness1-6."

      References:

      1. Junker, J.P. et al. Genome-wide RNA Tomography in the zebrafish embryo. Cell 159, 662-675 (2014).
      2. Liu, C. et al. Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis. Dev Cell 57, 1284-1298 e1285 (2022).
      3. Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777-1792 e1721 (2022).
      4. Wang, M. et al. High-resolution 3D spatiotemporal transcriptomic maps of developing Drosophila embryos and larvae. Dev Cell 57, 1271-1283 e1274 (2022).
      5. Shi, H. et al. Spatial atlas of the mouse central nervous system at molecular resolution. Nature 622, 552-561 (2023).
      6. Satija, R. et al. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33, 495-502 (2015)
      1. Scientific Name Consistency: Ensure consistency in using either "Danio rerio" or "zebrafish" throughout the manuscript.

      Response: We thank the reviewer for this suggestion. We have changed "Danio rerio" to "zebrafish" to make "zebrafish" consistent throughout the manuscript.

      1. Related References: Include the following relevant references:

      o https://academic.oup.com/bib/article/25/4/bbae316/7705532

      o https://www.life-science-alliance.org/content/6/1/e202201701

      Response: We thank the reviewer for bringing these two relevant works to us. Baul et al. (2024) presented STGAT leveraging Graph Attention Networks for integrating spatial transcriptomics and bulk RNA-seq, and Liu et al. (2023) demonstrated the concordance of MERFISH ST with bulk and single-cell RNA-seq. Both are excellent works and relevant to our work. We have added these two references in line 61 and line 68, respectively.

      References:

      Baul, S. et al. Integrating spatial transcriptomics and bulk RNA-seq: predicting gene expression with enhanced resolution through graph attention networks. Brief Bioinform 25 (2024).

      Liu, J. et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance 6 (2023).

      1. Figure 1a: In the Venn diagram, include the number of genes in the bulk and Stereo-Seq datasets, as well as the number of overlapping genes.

      Response: We thank the reviewer reminding us to include these important numbers. And in our current manuscript, we have added the following sentences in the Methods section to provide the gene numbers (Fig. 8). While the Venn diagram in Figure 1a serves as a schematic representation, so we did not include the gene numbers, as these may vary depending on the actual data.

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      1. Figure 1 Improvement: Enlarge Figure 1 and reduce repetitive elements, such as parts of the deconvolution and Figure 1b.

      Response: We thank the reviewer for the helpful suggestion. We agree with the reviewer that the deconvolution sections appear repetitive. We have updated Figure 1 (Fig. 9) by replacing these repetitive elements with a clearer and simpler diagram.

      See response letter for the figure.

      1. Figure 3f: Explain the black discontinuous line in the plot.

      Response: We thank the reviewer for the reminder. We are sorry about the lack of the explanation. We have added the below explanation for the black discontinuous line in the legend of Figure 3 (Fig. 10) as follows.

      See response letter for the figure.

      1. Line 610: State the percentage of unpaired imaging spots.

      Response: We thank the review for the reminder. We are sorry about not including the paired and unpaired spot number. We have added the number of paired spots with the percentage in the total spots in the Method section as follows.

      "The numbers of mapped spots for the 10 hpf, 12 hpf and 16 hpf embryos are 15,379 (69.4% of the total spots), 14,697 (70.5% of the total spots) and 21,605 (77.2% of the total spots), respectively."

      1. Lines 616-618: Specify the unit for the spot diameter.

      Response: We thank the reviewer for the reminder. Again, we are sorry about not including the spot diameter information in our previous version of manuscript. We have added the spot diameter in Method section as follows.

      "In the Stereo-seq data, each spot contained 15 × 15 DNA nanoball (DNB) spots (The diameter of each spot is near 10 μm)."

      Reviewer #1 (Significance):

      This algorithm will be useful not only for the field of developmental biology but also for wider applications in spatial omics. Although I have expertise in spatial omics technology development, my understanding of computational biology is limited, which restricts my ability to fully evaluate the Palette algorithm presented in this paper.

      Response: We thank the reviewer for recognizing our work, and we greatly appreciate the constructive suggestions from the reviewer. Although the reviewer acknowledged limited expertise in computational biology, the comments from the reviewer are highly professional and valuable. Following the suggestions from the reviewer, we have not only included more explanatory texts and figures to make the analysis procedures clearer and more understandable, but also supplemented the important parameters that were missing in our previous manuscript. We also provided extra figure to demonstrate the improvements of zSTEP on gene expression patterns. We believe that our work is now more scientific and more understandable, and we will continue working to solve the remaining issues as planned. We express our thanks for the reviewer again.

      Reviewer #2 (Evidence, reproducibility and clarity):

      The authors of the study introduce the Palette method, a novel approach designed to infer spatial gene expression patterns from bulk RNA-sequencing (RNA-seq) data. This method is complemented by the development of the DreSTEP 3D spatial gene expression atlas of zebrafish embryos, establishing a comprehensive resource for visualizing gene expression and investigating spatial cell-cell interactions in developmental biology.

      Response: We sincerely appreciate the reviewer's positive feedback on our Palette pipeline and the zSTEP 3D spatial expression atlas of zebrafish embryos. We also thank the reviewer for the professional comments and constructive suggestions. The reviewer raised the concerns from the aspect of algorithm design and computational biology, which we did not address well in our previous manuscript. We agree with the reviewer that we did not clarify the selection criteria of the parameters in detail, and we are now working on the additional analyses to address this issue.

      We also agree with the reviewer that we did not provide enough discussion of the strategies used in the pipeline, the features of Palette and the application scenarios of Palette and zSTEP. For wide use of our tools, it is significantly important to state these aspects. In this revised version, we have added more paragraphs in the Discussion section to address this issue. Additionally, we acknowledge that we did not adequately demonstrate the computational efficacy and computational requirements, which are important for researchers. We are also working on the additional analyses to address this issue.

      Finally, we thank the reviewer again for the professional and constructive suggestions. These suggestions are addressable, and by following them, we believe our manuscript will see a significant improvement, especially in the Palette pipeline part, making the pipeline more rigorous and easier to access. We are confident that we can complete the planned additional tasks within the next 1-2 months.

      1. The efficacy of the Palette method may be compromised by its dependency on the quality of the reference spatial transcriptomics data. As highlighted in the study, variations in data quality can lead to significant challenges in reconstructing accurate spatial expression patterns from bulk data. This underscores the necessity of evaluating quality parameters, such as the number of gene detections and spatial resolution, to ensure reliable outcomes. Additional studies should rigorously assess how these quality factors influence the accuracy and efficiency of the algorithm in various data contexts, particularly under diverse conditions of gene detection.

      Response: We thank the reviewer for this valuable suggestion. We agree with the reviewer that the quality of the reference ST data may greatly influence the performance and efficacy of the Palette, and we have added paragraphs in the Discussion section to further discuss the impact of ST data quality on Palette performance. As mentioned by the reviewer, gene detections and spatial resolution are two important parameters that can influence the Palette performance. Low gene detection may impact the clustering process, making the cell types of spots not distinguished well. To evaluate the performance of Palette when ST data shows low gene detection, we plan to applied Palette using MERFISH data as the ST reference, which only captures hundreds of genes. Moreover, we will also investigate the impact of spatial resolution on Palette performance by merging ST spots to simulate lower resolution scenarios, as well as the impact of gene detection by randomly reducing detected genes. Through the comparison among the inferred expression patterns with ST data of different spatial resolutions or different numbers of detected genes, we can better access the performance of Palette and provide guidance to researchers on the appropriate ST data requirements for optimal performance. These analyses will take another one month to accomplish after this round of revision due to the limited response time.

      1. The methodology raises pertinent questions regarding how the clustering results from different algorithms may affect the reconstructions by the Palette method. The authors would better provide a detailed discussion/comparison of clustering processes that optimize the reconstruction of spatial patterns, ensuring precision in the downstream analyses.

      Response: We thank the reviewer for the constructive comments. We agree with the reviewer that the differences in clustering results would impact the inference of the Palette. In our Palette pipeline, rather than develop a new methodology for clustering, we employ the BayesSpace for spot clustering, which considers both spot transcriptional similarity and neighbouring structure for clustering. In this case, researchers may adjust the parameters in the BayesSpace package to achieve optimal clustering results. Actually, in most cases, the spot identities were achieved through UMAP analysis, which only considers the transcriptional differences but does not consider the spatial information. This kind of clustering strategy will potentially lead to an intricate arrangement of spots belonging to different clusters, and may result in sparse gene expression in Palette outcome, which is different from the patterns in bona fide tissues. Therefore, a suitable clustering strategy will definitely help capture the local patterns.

      Moreover, our Palette pipeline also can use the clustering results from the tissue histomorphology. Using tissue histomorphology for clustering would be a good choice, as it is closer to the real case. The following Figure (Fig. 11) displays the Palette performance on PDAC datasets using both spatial clustering and histomorphology clustering strategies. The result using histomorphology clustering captures the weak pattern (indicated by the red circle) that were missed when using the spatial clustering (Fig. 11d).

      See response letter for the figure.

      1. The choice to utilize only highly expressed genes in the initial stages of the Palette algorithm also warrants further exploration. Addressing the criteria for determining which genes qualify as "highly expressed" and outlining robust cutoff will enhance the algorithm's rigor and applicability. Similarly, in the iterative estimation of gene expression across spatial spots, establishing optimal iteration conditions is crucial. Implementing a loss function may offer a systematic method for concluding iterations, thus refining computational efficiency.

      Response: We thank the reviewer for the professional suggestions. As pointed out by the reviewer, the selection of highly expressed genes and the iteration times are two important parameters in our pipeline. The definition of highly expressed genes and the number of highly expressed genes are important for achieving a satisfactory clustering performance. We tested the impact of different numbers of highly expressed genes on cluster performance in our preliminary analyses, while we did not summarize these tests and specify the parameters. Therefore, we plan to include a supplementary figure showing the clustering performances under different definitions of highly expressed genes and different numbers of highly expressed genes. Additionally, for the iteration conditions, we have tested different iteration numbers to find out a suitable iteration number to achieve a stable expression in each spot. The following figure (Fig. 1) shows the results after performing Palette with different iteration times. We randomly selected 20 cells and compared their expression across tests with varying iteration times. The results indicate that for a ST dataset with 819 spots, the expression in each spot becomes nearly stable after 5000 iteration times. We previously did not consider the computational efficiency, while here the reviewer raises a valuable and professional suggestion to implement a loss function to determine the optimal number of iterations. We greatly appreciate this suggestion, and plan to apply a loss function to summarize the optimal iteration times for ST datasets of different sizes. This will provide guidance for potential researchers in selecting iteration times and enhance computational efficiency.

      See response letter for the figure.

      1. Performance metrics relating to processing speed and computational demands remain inadequately addressed in the current framework. Understanding how the Palette method scales across varying gene counts and bulk RNA-seq datasets will be essential for potential applications in larger biological contexts. Notably, the quantitative demands of analyzing 20,000 genes when processing 10, 100, or 1,000 bulk RNA profiles must be articulated to guide researchers in planning accordingly.

      Response: We thank the reviewer for this valuable and professional suggestion. In our previous analyses, we did not consider the computation efficiency, processing speed and computational demands, which are important information for potential researchers. To address this issue, we will list our computer configuration first. And under this configuration, we plan to run Palette on datasets with different numbers of overlapped genes or ST references with varying spot numbers, and then summarize the running times into a metrics table. This will help researchers estimate the running time for their datasets and guide them in planning the analyses. We will begin the analyses soon and expect to complete the analysis within the next 1 to 2 months.

      Minor opinions:

      1. Despite the promising advances offered by the zebrafish 3D reconstruction, there is a lack of details regarding numbers of the spatial transcriptomics (ST) data utilized, and the number of bulk RNA-seq data employed in the analyses. These parameters need to be clarified.

      Response: We thank the reviewer for reminding us of these parameters. We are sorry for not including these parameters in our previous manuscript. We have now included the numbers of bulk, ST and overlap genes in the Methods section as follows (Fig. 12).

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      1. Issues regarding spatial cell-cell communication, especially concerning interactions over longer distances, necessitate careful consideration. Introducing spatial distance constraints could help formulate more realistic models of cellular interactions, a vital aspect of embryonic development.

      Response: We thank the reviewer for this essential comment. We agree with the reviewer that the spatial distance is an essential factor to investigate in vivo cell-cell communication during embryonic development. Therefore, in our analyses, we employed CellChat for spatial cell-cell communication analysis, which can be used to infer and visualize spatial cell-cell communication network for ST datasets, considering the spatial distance as constrains of the computed communication probability. However, during our analyses, we observed that there were interactions between cell types over longer distances, as mentioned by the reviewer. We then investigated how these interactions of longer distances occurred. Here, we show the FGF interaction between tail bud and neural crest cells from our spatial cell-cell analysis as an example, and the distance between these two cell types appears quite significant (Fig. 13). We labelled tail bud cells and neural crest cells on the selected midline section and observed that, although most neural crest cells are distributed anteriorly, a small number of neural crest cells are located at tail, close to the tail bud cells. Therefore, the observed interaction between tail bud and neural crest cells is likely due to their adjacent distribution in the tail region, while the anteriorly distributed of neural crest spot in spatial cell-cell communication analysis reflects the anterior positioning of most neural crest cells. As a result, the distances shown on the spatial cell-cell communication analysis are not the real distance between two cell types.

      In most cases in our spatial cell-cell communication analyses, the observed interactions over longer distances are likely influenced by this visualization strategy. Additionally, pre-processing the dataset may enhance the performance of the analyses. Here we performed systematic analyses of the entire embryo, which can make the interactions between cell types appear massive. To investigate specific biological questions, researchers can subset cell types of interest or categorize them into different subtypes based on their positions.

      See response letter for the figure.

      1. Evaluation metrics such as the Adjusted Rand Index (ARI) and Root Mean Square Error (RMSE) represent critical tools for systematically measuring the similarity of inferred spatial patterns, yet their specific application within this context should be elaborated.

      Response: We thank the reviewer for recommending these two tools. We have applied them to evaluate the similarity between the expression patterns (Fig. 14). The inclusion of these statistical values makes our comparisons of expression patterns more scientific and convincing. And we have added the following texts in the Methods section to describe the calculation of these two values.

      "The Adjusted Rand Index (ARI) and Root Mean Square Error (RMSE) were used to evaluate the similarity of the expression patterns. The expression patterns of in situ hybridization images were considered as the expected values, and the expression patterns of ST data and inferred expression patterns were compared to the expected values. Common positions along the AP axis within all three expression profiles were used, and the RMSE were calculated based on the scaled intensity of these positions. Values greater than the threshold were set to 1; otherwise, they were set to 0, and the ARI was then calculated based on the intensity category. Higher ARI and lower RMSE indicate greater similarity."

      See response letter for the figure.

      1. The study's limitations surrounding ST data quality cannot be overstated. Discussing scenarios where only limited or poor-quality ST data are available will be crucial for guiding future studies. Furthermore, a clear explanation of how enhanced specificity and accuracy translate into tangible biological insights is essential for demystifying the underlying mechanisms driving developmental processes.

      Response: We thank the reviewer for raising this essential suggestion. We have realized that in our previous manuscript, our discussion on the advantages and limitations of Palette and zSTEP was neither broad nor detailed enough.

      Therefore, in our revised manuscript, we have added the following paragraphs to further discuss the advantages and limitations of Palette and zSTEP, as well as the potential application of zSTEP in developmental biology.

      In this section, we have emphasized again the impact of ST data quality on the performance of Palette and zSTEP, and then compared Palette with the strategy that uses well-established marker genes to infer spatial information. We demonstrated that although Palette cannot achieve single cell resolution, it captures the major expression patterns, which are closely correlated to biological functions and critical for embryonic development. Furthermore, we further discussed that zSTEP is not only a valuable tool for investigating gene expression patterns, but also has the potential in evaluating the reaction-diffusion model to investigate the complicated and well-choreographed pattern formation during embryonic development.

      As here we have provided a more comprehensive discussion about Palette and zSTEP, we think that the potential researchers will better understand the application scenarios of our inference pipeline and our datasets. We hope our study can assist and inspire further research in the field of spatial transcriptomics and developmental biology.

      "Thirdly, the performance of Palette and zSTEP heavily relied on the quality of ST data. If the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots, and the performance of spot clustering could also be affected. Moreover, in this study, for example, the Stereo-seq data of 12 hpf zebrafish embryo had fewer slices on the right side (Fig. S3b), resulting in more blank spots in the right part of zSTEP for the 12 hpf embryo. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be enhanced and demonstrate even greater potential for analysing spatiotemporal gene expression.

      On the other hand, compared to the brilliant strategy that infers spatial information of scRNA-seq data from well-established genes, our Palette pipeline cannot achieve single cell resolution. However, our Palette pipeline is based on the ST reference, and thus preserves the real positional relationships between spots. Furthermore, the focus of our pipeline is to infer the gene expression patterns, which are closely correlated to biological functions and critical for embryonic development, rather than the sparse expression within individual spots. In this regard, our Palette pipeline can be advantageous, as it allows for reconstruction of the major expression profiles, which are often more relevant for understanding developmental processes. Additionally, our Palette can be applied to serial sections, enabling the construction of 3D ST atlas.

      Finally, while the current analyses demonstrated that zSTEP can serve as a valuable tool for identifying genes having specific patterns at certain developmental stages, the exploration of zSTEP is still limited. During animal development, pattern formation is always one of the most important developmental issues. As demonstrated by the reaction-diffusion (RD) model, morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification, while interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette constructed zSTEP, as a comprehensive transcriptomic expression pattern during development, could be leveraged to evaluate and prove the RD model during development, including AP patterning. Moreover, the investigation of gene expression patterns should not be limited to morphogens and TFs, and further investigation of their roles in AP patterning is desirable. Additionally, here a random forest model may be sufficient for investigating the most essential morphogens and TFs for AP axis refinement, while more sophisticated machine learning models may be required for addressing more specific biological questions."

      Reviewer #2 (Significance):

      The Palette pipeline demonstrates a marked improvement in specificity and accuracy when predicting spatial gene expression patterns. Evaluative studies on Drosophila and zebrafish datasets affirm its enhanced performance compared to existing methodologies. By effectively reconstructing spatial information from bulk transcriptomic data, the Palette method innovatively merges the philosophy of leveraging single-cell transcriptomic data for deconvolution analyses. This integration is pivotal, advancing traditional bulk RNA-seq approaches while laying the groundwork for future research.

      One of the notable achievements in this work is the construction of the DreSTEP atlas, which integrates serial bulk RNA-seq data with advanced 3D imaging techniques. This resource grants researchers unprecedented access to the visualization of gene expression patterns across the zebrafish embryo, facilitating the investigation of spatial relationships and cell-cell interactions critical for developmental processes. Such capabilities are invaluable for understanding the intricate dynamics of embryogenesis and the distinct roles of individual cell types.

      Response: We thank the reviewer for the positive evaluation of our work, either the Palette pipeline or zSTEP. The reviewer has strong expertise in algorithm development and computational biology, and the concerns and suggestions from the reviewer are significantly precious and valuable for us. Regarding the bioinformatics tool development, we did not have extensive experiences, and thus we did not thoroughly address the selection criteria or clarify the parameters used in the pipeline, which may influence the application by other researchers. Therefore, we sincerely appreciate the professional suggestions from the reviewer, which we can follow to address these issues, improve our manuscript and make our work more impactful for researchers. Additionally, we did not consider computation efficiency, processing speed and computational demands, which would be important factors for other researchers to use Palette. We would like to add extra analyses to address these aspects.

      Currently, based on the suggestions from the reviewer, we have added extra texts discussing the clustering strategy in Palette pipeline, the advantages and limitations of Palette, and the potential application of zSTEP in developmental biology. We believe that readers will now have a clearer understanding of the performance of Palette and the application scenarios of both Palette and zSTEP. We have not fully addressed the comments raised by the reviewer yet, while we are working on the planned additional analyses and expect to complete all these tasks within the next 1-2 months. We sincerely thank the reviewer for the professional and valuable suggestions, which definitely improve our work and will make it accessible for a wide range of researchers.

      Finally, through this review process, we have learned a lot about the important considerations and requirements when designing bioinformatics tools, and we benefit a lot from the thoughtful guidance. We express our thanks to the reviewer again for the guidance, and we will try our best to address the remaining issues to further improve our manuscript.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Evidence, reproducibility and clarity

      In this study, Dong and colleagues developed a computational pipeline to use spatial transcriptomics (ST) datasets as a reference to infer the spatial patterns of gene expression from bulk RNA sequencing data. This approach aims to overcome the low read depth and limited gene detection capabilities in current ST datasets, while exploiting its ability to provide highly resolved spatial information. By combining bulk RNA-seq datasets from 3 developmental stages during early zebrafish development with previously available ST and imaging datasets, the authors build DreSTEP (Danio rerio spatiotemporal expression profiles). Using this approach, they go on to identify the morphogens and transcription factors involved in anteroposterior patterning.

      The paper is well written, and the pipeline presented in this study is likely to be useful beyond the case studies included in this study. There are a few questions that, in my view, would be important to clarify to increase the impact of this work:

      Response: We sincerely appreciate the positive feedback from the reviewer on the Palette pipeline and zebrafish spatiotemporal expression profiles zSTEP. We thank the reviewer for the constructive suggestions, which have inspired us to think deeply about application and advantages of Palette and zSTEP for future studies.

      We fully agree with the reviewer that we do not sufficiently clarify the advantages and limitations of our inference pipeline in the original manuscript. The questions raised by the reviewer are very insightful. For example, while the inference expression patterns may closely resemble the in situ hybridization observation, which we consider as good performance, the reviewer pointed out that we should consider whether weak, yet real expression may have been removed. These questions have motivated us to think more deeply about the underlying principles and assumptions of our inference pipeline. Following the reviewer's questions, we have expanded our discussion on the application of zSTEP in developmental biology and the features of Palette compared to the existing strategies.

      We believe that after incorporating the revisions, our current manuscript now demonstrates the application scenario of Palette clearer and suggested the application of zSTEP for investigating biological questions in developmental biology. We are grateful for the reviewer's guidance, which helps us increase the impact of our work.

      1. The authors mention that they used a variable factor to adjust expression differences between the ST and bulk RNA-seq datasets. It would be important for the authors to comment on how much overlap in gene expression is necessary between the datasets for an accurate calculation of this variable factor? Can this be directly tested, for instance, by testing how their conclusions vary if expression is adjusted by a variable factor calculated from only a smaller set of genes?

      Response: We thank the reviewer for the professional questions. We are sorry about not including the gene numbers in our previous manuscript. And now we have provided the numbers of genes in bulk and ST data and the numbers of the overlapped genes (Fig. 15).

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      For Palette implementation, we took all the overlapped genes. To calculate the variable factor, we aggregated the expression of each gene in the ST data, and then used the expression of the bulk data to divide the aggregated expression for variable factor calculation. As a result, each overlapped gene was assigned a variable factor to adjust its expression, based on its difference between bulk and ST data. The rationale behind this approach is that by considering the ST data as a whole, we can effectively reduce the variations among individual spots. This allows the variable factors to provide reasonable adjustment to gene expression.

      Above all, the variable factors can be directly calculated. Currently Palette only can infer the expression patterns of overlapped genes. It means when the number of overlapped genes is small, such as MERFISH only detecting hundreds of genes, Palette can only infer the expression patterns of these genes. However, if the MERFISH data have good quality, which enable resolving distinct cell types, we believe Palette will also show good performance when using MERFISH as ST reference. Additionally, we plan to perform Palette using MERFISH as ST reference to further demonstrate its broad application when using different ST references.

      1. Palette gives rise to highly spatially precise patterns, which closely match those found in ISH. However, the smoothening of the expression can also remove weak, yet real, local expression patterns, as shown for idgf6 in Fig. 2a. Can the authors test this more extensively for other genes?

      Response: We thank the reviewer for this essential question. We agree with the reviewer that weak, yet real expression might be removed in our Palette inference pipeline. The weak, sparse expression may be due to the ST technique itself or the variations in samples. However, that sparse gene expression may not have biological meaning, and the focus of our pipeline in to capture the expression patterns, which are closely correlated with functions and crucial for embryonic development. Therefore, our algorithm considers spot characteristics and emphasize cluster-specific expression, resulting in spatial-specific expression patterns. In most cases, the main gene expression patterns can be captured, which can help understand gene functions and roles in embryonic development. We have updated Supplementary Figure S1a (Fig. 16) to include more gene patterns to demonstrate this point.

      See response letter for the figure.

      1. Using adjacent slices for ST and "bulk RNA-seq" may provide better results than those obtained when comparing two independent datasets. Could the authors also extend the analysis of Palette's functionalities by using separate, previously available but independent datasets, for ST and bulk RNA-seq in Drosophila as well?

      Response: We thank the reviewer for the valuable question. We agree with the reviewer that using adjacent slices may provide better results. The idea here is that the inferred spatial expression patterns from pseudo bulk RNA-seq can be used to compare with the real expression of ST to evaluate Palette performance. We have updated our Figure 2a (Fig. 17) to illustrate the analysis clearer.

      See response letter for the figure.

      To demonstrate the Palette's functionalities, we have used Palette to infer zebrafish bulk RNA-seq slice (Junker et al., 2014) using Stereo-seq slice (Liu et al., 2022) as ST reference, and these two datasets are separate and independent. We agree with the reviewer that it would be good to use separate datasets to test in Drosophila to further demonstrate the Palette's functionalities. However, unfortunately, we did not find the Drosophila serial bulk RNA-seq data along left-right axis of the corresponding stages, and thus we might be unable to perform the extra analyses using independent Drosophila datasets.

      References:

      Junker, J.P. et al. Genome-wide RNA Tomography in the zebrafish embryo. Cell 159, 662-675 (2014).

      Liu, C. et al. Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis. Dev Cell 57, 1284-1298 e1285 (2022).

      1. The DreSTEP analysis in zebrafish embryos is interesting and validates well-established observations in the field. Can the authors also discuss whether and how their dataset allows them to refine our understanding of the spatial or temporal pattern of the morphogens and TFs involved in AP patterning? This would further validate their approach.

      Response: We appreciate the reviewer for recognition of our zSTEP and raising this valuable question, which has inspired us to think more deeply about the potential application of zSTEP in developmental biology. As the reviewer noted, our zSTEP analyses have validated well-established observations in the field. Rather than focusing on the sparse expression detected in ST data, zSTEP emphasizes the gene expression patterns that are closely correlated with biological functions and critical for embryonic development. Therefore, zSTEP can serve as a valuable tool for identifying the genes having specific patterns at certain developmental stages.

      Pattern formation is one of the most important developmental issues for all animals. The reaction-diffusion (RD) model is a widely recognized theoretical framework used to explain self-regulated pattern formation in developing animal embryos (Kondo & Miura, 2010). Morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification. Most importantly, interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette-constructed zSTEP provides a comprehensive transcriptomic expression pattern, including all morphogens and TFs, across the whole embryo during development. These valuable resources, in our opinion, could be leveraged to evaluate and prove the RD model during development, including AP patterning. In our current zSTEP analyses, we have already identified genes that exhibit specific expression patterns along AP axis, some of which have not been fully characterized. These genes could be potential targets for further investigation into their roles in AP patterning, although they are not the primary focus of this study. Additionally, our analyses only focused on morphogens and TFs, but zSTEP can be used to investigate the expression patterns of other genes as well. Moreover, we employed a random forest model to investigate the most essential morphogens and TFs for AP axis refinement, which is one of the basic applications of zSTEP. To investigate specific biological questions of interest, it would be worth exploring the use of more sophisticated machine learning models.

      We have added the following paragraph in the Discussion section to discuss the potential application of zSTEP in future studies.

      "Finally, while the current analyses demonstrated that zSTEP can serve as a valuable tool for identifying genes having specific patterns at certain developmental stages, the exploration of zSTEP is still limited. During animal development, pattern formation is always one of the most important developmental issues. As demonstrated by the reaction-diffusion (RD) model, morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification, while interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette constructed zSTEP, as a comprehensive transcriptomic expression pattern during development, could be leveraged to evaluate and prove the RD model during development, including AP patterning. Moreover, the investigation of gene expression patterns should not be limited to morphogens and TFs, and further investigation of their roles in AP patterning is desirable. Additionally, here a random forest model may be sufficient for investigating the most essential morphogens and TFs for AP axis refinement, while more sophisticated machine learning models may be required for addressing more specific biological questions."

      Reference

      Kondo, S. & Miura, T. Reaction-Diffusion model as a framework for understanding biological pattern formation. Science 329, 1616-1620 (2010).

      1. Can the authors comment on the limits of this inference pipeline? And how it performs as compared to single-cell RNA sequencing datasets where spatial information is inferred from well-established marker genes?

      Response: We appreciate the reviewer for this insightful question, which has inspired us to further explore the advantages and limitations of the Palette pipeline in comparison with other inference strategies. As mentioned in the Discussion section, a key limitation of the inference pipeline is its heavy reliance on the quality of ST data. It is obvious that if the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots. We think it is a common issue for any inference tools using ST data as the reference. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be improved.

      As a comparison, the single-cell RNA sequencing datasets where spatial information is inferred from well-established marker genes do not face this limitation. The ground-breaking work by Satija et al. (2015) used such a strategy that combined scRNA-seq and in situ hybridizations of well-established marker genes to infer spatial location, enabling single cell resolution, as it maintains the high read depth and gene detection. One advantages of this scRNA-seq-based strategy is that it provides the transcriptomics of individual cells, rather than a combination of cell within a ST spot, although the positional relationships between cells are not real.

      However, compared to the inference from ST data, the positional relationships between cells are not directly captured. On the other hand, as the embryonic development progresses, more cell types will be specified, and the body patterning becomes more complex. In this scenario, using well-established marker gene to infer spatial information would be much more challenging. Additionally, there are not many scRNA-seq datasets of serial sections, and thus this strategy may not be used to construct 3D ST atlas.

      In contrast, our Palette inference pipeline is based on the ST data, which preserves the real positional relationships between spots. Although our inference pipeline cannot achieve single cell resolution, it focuses on the gene expression patterns rather than the sparse expression within individual spots. By applying Palette to paired serial sections, we were able to generated a 3D spatial expression atlas of zebrafish embryos, which has showed promising performance for investigating gene expression patterns and their involvement in AP patterning.

      Reference

      Satija, R. et al. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33, 495-502 (2015)

      We have updated the following paragraphs to further demonstrating the limitation of the inference pipeline in details in the Discussion section.

      "Thirdly, the performance of Palette and zSTEP heavily relied on the quality of ST data. If the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots, and the performance of spot clustering could also be affected. Moreover, in this study, for example, the Stereo-seq data of 12 hpf zebrafish embryo had fewer slices on the right side (Fig. S3b), resulting in more blank spots in the right part of zSTEP for the 12 hpf embryo. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be enhanced and demonstrate even greater potential for analysing spatiotemporal gene expression.

      On the other hand, compared to the brilliant strategy that infers spatial information of scRNA-seq data from well-established genes, our Palette pipeline cannot achieve single cell resolution. However, our Palette pipeline is based on the ST reference, and thus preserves the real positional relationships between spots. Furthermore, the focus of our pipeline is to infer the gene expression patterns, which are closely correlated to biological functions and critical for embryonic development, rather than the sparse expression within individual spots. In this regard, our Palette pipeline can be advantageous, as it allows for reconstruction of the major expression profiles, which are often more relevant for understanding developmental processes. Additionally, our Palette can be applied to serial sections, enabling the construction of 3D ST atlas."

      Reviewer #3 (Significance):

      This study tackles an important challenge in biology - the difficult to resolve gene expression patterns with high spatial precision and in a high-throughput manner. By integrating sequencing datasets from previously published studies, as well as newly-generated datasets, the authors provide evidence that their novel inference pipeline enables them to obtain high-quality spatial information simply from bulk RNA-seq datasets, using ST as a reference. The development of this pipeline - Palette - is a major part of this manuscript and its applicability is validated using datasets from Drosophila and zebrafish embryos. This in an important advance for the field, but it would be nice for the authors to further comment on i) the validity of some of their approaches and how they may influence the quality of their inference, as well as, ii) potential pitfalls/limitations of this approach as compared to others available in the field. This would synthetize both previous and current findings into a conceptual and technological framework that would have a strong impact well beyond cell and developmental biology.

      Audience: This study would be relevant for a broad audience of biologists, interested in morphogen signaling, gene regulatory networks and cell fate specification.

      Expertise in zebrafish development, gastrulation, morphogen signaling and morphogenesis.

      Response: We thank the reviewer for providing the positive feedback, arising these valuable questions, which have motivated us to deeply consider the design concept and further application of Palette and zSTEP. Based on the insightful questions from the reviewer, we have added two extra paragraphs in the Discussion section to further discuss the potential application of zSTEP in developmental biology and application scenarios of the Palette pipeline. Specially, we have demonstrated that the performance of the inference pipeline relies on the spatial resolution and data quality of the ST data. We have then compared the advantages and limitations of Palette with the existing brilliant spatial inference strategy, which infers spatial information of scRNA-seq from well-established marker genes. Although our inference pipeline cannot achieve single cell resolution, it can capture the major expression patterns, which are closely correlated to functions and critical for embryonic development. We believe this will help readers gain a clearer understanding of the advantage and limitations of our pipeline compared to other tools, as well as the tasks for which Palette and our constructed zSTEP can be utilized. We express our thanks to the reviewer again for the valuable comments.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We would like to thank the reviewers for their efforts and feedback on our preprint. We have elected to rework the manuscript for publication in a different journal. In this process we will alter many of the approaches and re-evaluate the conclusions. With this, many of the points raised by the reviewers will be no longer relevant and therefore do not require a response. Again, we thank the reviewers for their time and helpful feedback.


      The following is the authors’ response to the original reviews.

      eLife Assessment:

      The authors present a potentially useful approach of broad interest arguing that anterior cingulate cortex (ACC) tracks option values in decisions involving delayed rewards. The authors introduce the idea of a resource-based cognitive effort signal in ACC ensembles and link ACC theta oscillations to a resistance-based strategy. The evidence supporting these new ideas is incomplete and would benefit from additional detail and more rigorous analyses and computational methods.

      We are extremely grateful for the several excellent and comments of the reviewers. To address these concerns, we have completely reworked the manuscript adding more rigorous approaches in each phase of the analysis and computational model. We realize that this has taken some time to prepare the revision. However, given the comments of the reviewers, we felt it necessary to thoroughly rework the paper based on their input. Here is a (nonexhaustive) overview of the major changes we made:

      We have developed a way to more adequately capture the heterogeneity in the behavior

      We have completely reworked the RL model

      We have added additional approaches and rigor to the analysis of the value-tracking signal. 

      Reviewer #1 (Public Review):

      Summary:

      Young (2.5 mo [adolescent]) rats were tasked to either press one lever for immediate reward or another for delayed reward. 

      Please note that at the time of testing and training that the rats were > 4 months old. 

      The task had a complex structure in which (1) the number of pellets provided on the immediate reward lever changed as a function of the decisions made, (2) rats were prevented from pressing the same lever three times in a row. Importantly, this task is very different from most intertemporal choice tasks which adjust delay (to the delayed lever), whereas this task held the delay constant and adjusted the number of 20 mg sucrose pellets provided on the immediate value lever.

      Several studies parametrically vary the immediate lever (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183). While most versions of the task will yield qualitatively similar estimates of discounting, the adjusting amount is preferred as it provides the most consistent estimates (PMID: 22445576). More specifically this version of the task avoids contrast effects of that result from changing the delay during the session (PMID: 23963529, 24780379, 19730365, 35661751) which complicates value estimates. 

      Analyses are based on separating sessions into groups, but group membership includes arbitrary requirements and many sessions have been dropped from the analyses. 

      We have updated this approach and now provide a more comprehensive assessment of the behavior. The updated approach applies a hierarchical clustering model to the behavior in each session. This was applied at each delay to separate animals that prefer the immediate option more/less. This results in 4 statistically dissociable groups (4LO, 4HI, 8LO, 8HI) and includes all sessions. Please see Figure 1. 

      Computational modeling is based on an overly simple reinforcement learning model, as evidenced by fit parameters pegging to the extremes. 

      We have completely reworked the simulations in the revision. In the updated RL model we carefully add parameters to determine which are necessary to explain the experimental data. We feel that it is simplified yet more descriptive. Please see Figure 2 and associated text. 

      The neural analysis is overly complex and does not contain the necessary statistics to assess the validity of their claims.

      We have dramatically streamlined the spike train analysis approach and added several statistical tests to ensure the rigor of our results. Please see Figures 4,5,6 and associated text. 

      Strengths:

      The task is interesting.

      Thank you for the positive comment

      Weaknesses:

      Behavior:

      The basic behavioral results from this task are not presented. For example, "each recording session consisted of 40 choice trials or 45 minutes". What was the distribution of choices over sessions? Did that change between rats? Did that change between delays? Were there any sequence effects? (I recommend looking at reaction times.) Were there any effects of pressing a lever twice vs after a forced trial? 

      Please see the updated statistics and panels in Figures 1 and 2. We believe these address this valid concern.  

      This task has a very complicated sequential structure that I think I would be hard pressed to follow if I were performing this task. 

      Human tasks implement a similar task structure (PMID: 26779747). Please note the response above that outlines the benefits of using of this task.   

      Before diving into the complex analyses assuming reinforcement learning paradigms or cognitive control, I would have liked to have understood the basic behaviors the rats were taking. For example, what was the typical rate of lever pressing? If the rats are pressing 40 times in 45 minutes, does waiting 8s make a large difference?

      Thank you for this suggestion. Our additions to Figure 1 are intended to better explain and quantify the behavior of the animals. Note that this task is designed to hold the rate of reinforcement constant no matter the choices of the animals. Our analysis supports the long-held view in the literature that rats do not like waiting for rewards, even at small delays. Going from the 4 à 8 sec delay results in significantly more immediate choices, indicating that the rats will forgo waiting 8 sec for a larger reinforcer and take a smaller reinforcer at 4 sec.  

      For that matter, the reaction time from lever appearance to lever pressing would be very interesting (and important). Are they making a choice as soon as the levers appear? Are they leaning towards the delay side, but then give in and choose the immediate lever? What are the reaction time hazard distributions?

      This is an excellent suggestion, we have added a brief analysis of reaction times (Please see the section entitled “4 behavioral groups are observed across all sessions” in the Results). Please note that an analysis of the reaction times has been presented in a prior analysis of this data set (White et al., 2024). In addition, an analysis of reaction times in this task was performed in Linsenbardt et al. (2017). In short, animals tend to choose within 1 second of the lever appearing. In addition, our prior work shows that responses on the immediate lever tend to be slower, which we viewed as evidence of increased deliberation requirements (possibly required to integrate value signals).   

      It is not clear that the animals on this task were actually using cognitive control strategies on this task. One cannot assume from the task that cognitive control is key. The authors only consider a very limited number of potential behaviors (an overly simple RL model). On this task, there are a lot of potential behavioral strategies: "win-stay/lose-shift", "perseveration", "alternation", even "random choices" should be considered.

      The strategies the Reviewer mentioned are descriptors of the actual choices the rats made. For example, perseveration means the rat is choosing one of the levers at an excessively high rate whereas alternation means it is choosing the two levers more or less equally, independent of payouts. But the question we are interested in is why? We are arguing that the type of cognitive control determines the choice behavior, but cognitive control is an internal variable that guides behavior, rather than simply a descriptor of the behavior. For example, the animal opts to perseverate on the delayed lever because the cognitive control required to track ival is too high. We then searched the neural data for signatures of the two types of cognitive control.

      The delay lever was assigned to the "non-preferred side". How did side bias affect the decisions made?

      The side bias clearly does not impact performance as the animals prefer the delay lever at shorter delays, which works against this bias.  

      The analyses based on "group" are unjustified. The authors compare the proportion of delayed to immediate lever press choices on the non-forced trials and then did k-means clustering on this distribution. But the distribution itself was not shown, so it is unclear whether the "groups" were actually different. They used k=3, but do not describe how this arbitrary number was chosen. (Is 3 the optimal number of clusters to describe this distribution?) Moreover, they removed three group 1 sessions with an 8s delay and two group 2 sessions with a 4s delay, making all the group 1 sessions 4s delay sessions and all group 2 sessions 8s delay sessions. They then ignore group 3 completely. These analyses seem arbitrary and unnecessarily complex. I think they need to analyze the data by delay. (How do rats handle 4s delay sessions? How do rats handle 6s delay sessions? How do rats handle 8s delay sessions?). If they decide to analyze the data by strategy, then they should identify specific strategies, model those strategies, and do model comparison to identify the best explanatory strategy. Importantly, the groups were session-based, not rat based, suggesting that rats used different strategies based on the delay to the delayed lever.

      We have completely reworked our approach for capturing the heterogeneity in behavior. We have taken care to show more of the behavioral statistics that have gone into identifying each of the groups. All sessions are included in this analysis. As the reviewer suggests, we used the statistics from each of the behavioral groups to inform the RL model that explores neural signals that underly decisions in this task. We strongly disagree that groups should be rat and not session based as the behavior of the animal can, and does, change from day to day. This is important to consider when analyzing the neural data as rat-based groupings would ignore this potential source of variance. 

      The reinforcement learning model used was overly simple. In particular, the RL model assumes that the subjects understand the task structure, but we know that even humans have trouble following complex task structures. Moreover, we know that rodent decision-making depends on much more complex strategies (model-based decisions, multi-state decisions, rate-based decisions, etc). There are lots of other ways to encode these decision variables, such as softmax with an inverse temperature rather than epsilon-greedy. The RL model was stated as a given and not justified. As one critical example, the RL model fit to the data assumed a constant exponential discounting function, but it is well-established that all animals, including rodents, use hyperbolic discounting in intertemporal choice tasks. Presumably this changes dramatically the effect of 4s and 8s. As evidence that the RL model is incomplete, the parameters found for the two groups were extreme. (Alpha=1 implies no history and only reacting to the most recent event. Epsilon=0.4 in an epsilongreedy algorithm is a 40% chance of responding randomly.)

      While we agree that the approach was not fully justified, we do not agree that it was invalid. Simply stated, a softmax approach gives the best fit to the choice behavior, whereas our epsilon-greedy approach attempted to reproduce the choice behavior using a naïve agent that progressively learns the values of the two levers on a choice-by-choice basis. Nevertheless, we certainly appreciate that important insights can be gained by fitting a model to the data as suggested. We feel that the new modeling approach we have now implemented is optimal for the present purposes and it replaces the one used in the original manuscript.

      The authors do add a "dbias" (which is a preference for the delayed lever) term to the RL model, but note that it has to be maximal in the 4s condition to reproduce group 2 behavior, which means they are not doing reinforcement learning anymore, just choosing the delayed lever.

      The dbias term was dropped in the new model implementation

      Neurophysiology:

      The neurophysiology figures are unclear and mostly uninterpretable; they do not show variability, statistics or conclusive results.

      While the reviewer is justified in criticizing the clarity of the figures, the statement that “they do not show variability, statistics or conclusive results” is not correct. Each of the figures presented in the first draft of the manuscript, except Figure 3, are accompanied by statistics and measures of variability. Nonetheless we have updated each of the neurophysiology analyses. We hope that the reviewer will find our updates more rigorous and thorough.   

      As with the behavior, I would have liked to have seen more traditional neurophysiological analyses first. What do the cells respond to? How do the manifolds change aligned to the lever presses? Are those different between lever presses?

      We have added several figures that plot the mean +/- SEM of the neural activity (see Figures 4 and 5). Hopefully this provides a more intuitive picture of the changes in neural activity throughout the task.  

      Are there changes in cellular information (both at the individual and ensemble level) over time in the session? 

      We provide several analyses of how firing rate changes over trials in relation to ival over time and trials in the session. In addition, we describe how these signals change in each of the behavioral groups. 

      How do cellular responses differ during that delay while both levers are out, but the rats are not choosing the immediate lever?

      We were somewhat unclear about this suggestion as the delay follows the lever press. In addition, there is no delay after immediate presses 

      Figure 3, for example, claims that some of the principal components tracked the number of pellets on the immediate lever ("ival"), but they are just two curves. No statistics, controls, or justification for this is shown. BTW, on Figure 3, what is the event at 200s?

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      I'm confused. On Figure 4, the number of trials seems to go up to 50, but in the methods, they say that rats received 40 trials or 45 minutes of experience.

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      At the end of page 14, the authors state that the strength of the correlation did not differ by group and that this was "predicted" by the RL modeling, but this statement is nonsensical, given that the RL modeling did not fit the data well, depended on extreme values. Moreover, this claim is dependent on "not statistically detectable", which is, of course, not interpretable as "not different".

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      There is an interesting result on page 16 that the increases in theta power were observed before a delayed lever press but not an immediate lever press, and then that the theta power declined after an immediate lever press. 

      Thank you for the positive comment. 

      These data are separated by session group (again group 1 is a subset of the 4s sessions, group 2 is a subset of the 8s sessions, and group 3 is ignored). I would much rather see these data analyzed by delay itself or by some sort of strategy fit across delays.

      Thank you for the excellent suggestion. Our new group assignments take delay into account. 

      That being said, I don't see how this description shows up in Figure 6. What does Figure 6 look like if you just separate the sessions by delay?

      We are unclear what the reviewer means by “this description”.  

      Discussion:

      Finally, it is unclear to what extent this task actually gets at the questions originally laid out in the goals and returned to in the discussion. The idea of cognitive effort is interesting, but there is no data presented that this task is cognitive at all. The idea of a resourced cognitive effort and a resistance cognitive effort is interesting, but presumably the way one overcomes resistance is through resourcelimited components, so it is unclear that these two cognitive effort strategies are different.

      The basis for the reviewers assertation that “the way one overcomes resistance is through resourcelimited components” is not clear. In the revised version, we have taken greater care to outline how each type of effort signal facilitates performance of the task and articulate these possibilities in our stochastic and RL models. We view the strong evidence for ival tracking presented herein as a critical component of resource based cognitive effort. 

      The authors state that "ival-tracking" (neurons and ensembles that presumably track the number of pellets being delivered on the immediate lever - a fancy name for "expectations") "taps into a resourced-based form of cognitive effort", but no evidence is actually provided that keeping track of the expectation of reward on the immediate lever depends on attention or mnemonic resources. They also state that a "dLP-biased strategy" (waiting out the delay) is a "resistance-based form of cognitive effort" but no evidence is made that going to the delayed side takes effort.

      We challenge the reviewers that assertation ival tracking is a “fancy name for expectations”. We make no claim about the prospective or retrospective nature of the signal. Clearly, expectations should be prospective and therefore different from ival tracking. Regarding the resistance signal: First, animals avoid the delay lever more often at the 8 sec delay (Figure 1). We have shown that increasing the delay systematically biases responses AWAY from the delay (Linsenbardt et al., 2017). This is consistent with a well-developed literature that rats and mice do not like waiting for delayed reinforcers. We contend that enduring something you don’t like takes effort. 

      The authors talk about theta synchrony, but never actually measure theta synchrony, particularly across structures such as amygdala or ventral hippocampus. The authors try to connect this to "the unpleasantness of the delay", but provide no measures of pleasantness or unpleasantness. They have no evidence that waiting out an 8s delay is unpleasant.

      We have added spike-field coherence to better contact the literature on synchrony. Note that we never refer to our results as “synchrony”. However, we would be remiss to not address the growing literature on theta synchrony in effort allocation. There is a well-developed literature that rats and mice do not like waiting for delayed reinforcers. If waiting out the delay was not pleasant then why do the animals forgo larger rewards to avoid it? 

      The authors hypothesize that the "ival-tracking signal" (the expectation of number of pellets on the immediate lever) "could simply reflect the emotional or autonomic response". Aside from the fact that no evidence for this is provided, if this were to be true, then, in what sense would any of these signals be related to cognitive control?

      This is proposed as an alternative explanation to the ival signal in the discussion. It was added as our due diligence. Emotional state could provide feedback to the currently implemented control mechanism. If waiting for reinforcement is too unpleasant this could drive them to ival tracking and choosing the immediate option more frequently. We provide this option only as a possibility, not a conclusion. We have clarified this in the revised text. Nevertheless, based on our review of the literature, autonomic tracking in some form, seems to be the most likely function of ACC (Seamans & Floresco 2022). While the reviewer may disagree with this, we feel it is at least as valid as all the complex, cognitively-based interpretations that commonly appear in the literature.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript explores the neuronal signals that underlie resistance vs resource-based models of cognitive effort. The authors use a delayed discounting task and computational models to explore these ideas. The authors find that the ACC strongly tracks value and time, which is consistent with prior work. Novel contributions include quantification of a resource-based control signal among ACC ensembles, and linking ACC theta oscillations to a resistance-based strategy.

      Strengths:

      The experiments and analyses are well done and have the potential to generate an elegant explanatory framework for ACC neuronal activity. The inclusion of local-field potential / spike-field analyses is particularly important because these can be measured in humans.

      Thank you for the endorsement of our work.

      Weaknesses:

      I had questions that might help me understand the task and details of neuronal analyses.

      (1) The abstract, discussion, and introduction set up an opposition between resource and resistancebased forms of cognitive effort. It's clear that the authors find evidence for each (ACC ensembles = resource, theta=resistance?) but I'm not sure where the data fall on this dichotomy.

      (a) An overall very simple schematic early in the paper (prior to the MCML model? or even the behavior) may help illustrate the main point.

      (b) In the intro, results, and discussion, it may help to relate each point to this dichotomy.

      (c) What would resource-based signals look like? What would resistance based signals look like? Is the main point that resistance-based strategies dominate when delays are short, but resource-based strategies dominate when delays are long?

      (d) I wonder if these strategies can be illustrated? Could these two measures (dLP vs ival tracking) be plotted on separate axes or extremes, and behavior, neuronal data, LFP, and spectral relationships be shown on these axes? I think Figure 2 is working towards this. Could these be shown for each delay length? This way, as the evidence from behavior, model, single neurons, ensembles, and theta is presented, it can be related to this framework, and the reader can organize the findings.

      These are excellent suggestions, and we have implemented them, where possible. 

      (2) The task is not clear to me.

      (a) I wonder if a task schematic and a flow chart of training would help readers.

      Yes, excellent idea, we have now included this in Figure 1. 

      (b) This task appears to be relatively new. Has it been used before in rats (Oberlin and Grahame is a mouse study)? Some history / context might help orient readers.

      Indeed, this task has been used in rats in several prior studies in rats. Please see the following references (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183).

      (c) How many total sessions were completed with ascending delays? Was there criteria for surgeries? How many total recording sessions per animal (of the 54?)

      Please note that the delay does not change within a session. There were no criteria for surgery. 

      (d) How many trials completed per session (40 trials OR 45 minutes)? Where are there errors? These details are important for interpreting Figure 1.

      Every animal in this data set completed 40 trials and we have updated the task description to clarify this issue. There are no errors in this task, but rather the task is designed to the tendency to make an impulsive choice (smaller reward now). 

      (3) Figure 1 is unclear to me.

      (a) Delayed vs immediate lever presses are being plotted - but I am not sure what is red, and what is blue. I might suggest plotting each animal.

      We have updated Figure 1 considerably for clarity. 

      (b) How many animals and sessions go into each data point?

      We hope this is clarified now with our new group assignments as all sessions were included in the analysis. 

      (c) Table 1 (which might be better referenced in the paper) refers to rats by session. Is it true that some rats (2 and 8) were not analyzed for the bulk of the paper? Some rats appear to switch strategies, and some stay in one strategy. How many neurons come from each rat?

      We have updated Table 1 based on our new groupings. The rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. 

      (d) Task basics - RT, choice, accuracy, video stills - might help readers understand what is going into these plots

      (e) Does the animal move differently (i.e., RTs) in G1 vs. G2?

      Excellent suggestion. We have added more analysis of the task variables in the revision (e.g. RT, choice comparisons across delays, etc…)

      (4) I wasn't sure how clustered G1 vs. G2 vs G3 are. To make this argument, the raw data (or some axis of it) might help.

      (a) This is particularly important because G3 appears to be a mix of G1 and G2, although upon inspection, I'm not sure how different they really are

      (b) Was there some objective clustering criteria that defined the clusters?

      (c) Why discuss G3 at all? Can these sessions be removed from analysis?

      Based on our updates to the behavioral analysis these comments are no longer relevant. 

      (5) The same applies to neuronal analyses in Fig 3 and 4

      (a) What does a single neuron peri-event raster look like? I would include several of these.

      (b) What does PC1, 2 and 3 look like for G1, G2, and G3?

      (c) Certain PCs are selected, but I'm not sure how they were selected - was there a criteria used? How was the correlation between PCA and ival selected? What about PCs that don't correlate with ival?

      (d) If the authors are using PCA, then scree plots and PETHs might be useful, as well as comparisons to PCs from time-shuffled / randomized data.

      We hope that our reworking of the neural data analysis has clarified these issues. We now include several firing rate examples and aggregate data.   

      (6) I had questions about the spectral analysis

      (a) Theta has many definitions - why did the authors use 6-12 Hz? Does it come from the hippocampal literature, and is this the best definition of theta? What about other bands (delta - 1-4 Hz), theta (4-7 Hz); and beta - 13- 30 Hz? These bands are of particular importance because they have been associated with errors, dopamine, and are abnormal in schizophrenia and Parkinson's disease.

      This designation comes mainly from the hippocampal and ACC literature in rodents. In addition, this range best captured the peak in the power spectrum in our data. Note that we focus our analysis on theta give the literature regarding theta in the ACC as a correlate of cognitive controls (references in manuscript). We did interrogate other bands as a sanity check and the results were mostly limited to theta. Given the scope of our manuscript and the concerns raised regarding complexity we are concerned that adding frequency analyses beyond theta obfuscates the take home message.

      However, the spectrograms in Figure 3 show a range of frequencies and highlight the ones in the theta band as the most dynamic prior to the choice. 

      (b) Power spectra and time-frequency analyses may justify the authors focus. I would show these (yaxis - frequency, x-axis - time, z-axis, power).

      Thank you for the suggestion. We have added this to Figure 3.    

      (7) PC3 as an autocorrelation doesn't seem the to be right way to infer theta entrainment or spikefield relationships, as PCA can be vulnerable to phantom oscillations, and coherence can be transient. It is also difficult to compare to traditional measures of phase-locking. Why not simply use spike-field coherence? This is particularly important with reference to the human literature, which the authors invoke.

      Excellent suggestion. Note that PCA provided a way to classify neurons that exhibited peaks in the autocorrelation at theta frequencies. We have added spike-field coherence, and this analysis confirms the differences in theta entrainment of the spike trains across the behavioral groups. Please see Figure 6D.   

      Reviewer #3 (Public Review):

      Summary:

      The study investigated decision making in rats choosing between small immediate rewards and larger delayed rewards, in a task design where the size of the immediate rewards decreased when this option was chosen and increased when it was not chosen. The authors conceptualise this task as involving two different types of cognitive effort; 'resistance-based' effort putatively needed to resist the smaller immediate reward, and 'resource-based' effort needed to track the changing value of the immediate reward option. They argue based on analyses of the behaviour, and computational modelling, that rats use different strategies in different sessions, with one strategy in which they consistently choose the delayed reward option irrespective of the current immediate reward size, and another strategy in which they preferentially choose the immediate reward option when the immediate reward size is large, and the delayed reward option when the immediate reward size is small. The authors recorded neural activity in anterior cingulate cortex (ACC) and argue that ACC neurons track the value of the immediate reward option irrespective of the strategy the rats are using. They further argue that the strategy the rats are using modulates their estimated value of the immediate reward option, and that oscillatory activity in the 6-12Hz theta band occurs when subjects use the 'resistancebased' strategy of choosing the delayed option irrespective of the current value of the immediate reward option. If solid, these findings will be of interest to researchers working on cognitive control and ACCs involvement in decision making. However, there are some issues with the experiment design, reporting, modelling and analysis which currently preclude high confidence in the validity of the conclusions.

      Strengths:

      The behavioural task used is interesting and the recording methods should enable the collection of good quality single unit and LFP electrophysiology data. The authors recorded from a sizable sample of subjects for this type of study. The approach of splitting the data into sessions where subjects used different strategies and then examining the neural correlates of each is in principle interesting, though I have some reservations about the strength of evidence for the existence of multiple strategies.

      Thank you for the positive comments. 

      Weaknesses:

      The dataset is very unbalanced in terms of both the number of sessions contributed by each subject, and their distribution across the different putative behavioural strategies (see table 1), with some subjects contributing 9 or 10 sessions and others only one session, and it is not clear from the text why this is the case. Further, only 3 subjects contribute any sessions to one of the behavioural strategies, while 7 contribute data to the other such that apparent differences in brain activity between the two strategies could in fact reflect differences between subjects, which could arise due to e.g. differences in electrode placement. To firm up the conclusion that neural activity is different in sessions where different strategies are thought to be employed, it would be important to account for potential cross-subject variation in the data. The current statistical methods don't do this as they all assume fixed effects (e.g. using trials or neurons as the experimental unit and ignoring which subject the neuron/trial came from).

      In the revised manuscript we have updated the group assignments. We have improved our description of the logic and methods for employing these groupings as well. With this new approach, all sessions are now included in the analysis. The group assignments are made purely on the behavioral statistics of an animal in each session. We feel this approach is preferable to eliminating neurons or session with the goal of balancing them, which may introduce bias. Further, the rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. As neurons are randomly sampled from each animal on a given session, we feel that we’re justified in treating these as fixed effects.   

      It is not obvious that the differences in behaviour between the sessions characterised as using the 'G1' and 'G2' strategies actually imply the use of different strategies, because the behavioural task was different in these sessions, with a shorter wait (4 seconds vs 8 seconds) for the delayed reward in the G1 strategy sessions where the subjects consistently preferred the delayed reward irrespective of the current immediate reward size. Therefore the differences in behaviour could be driven by difference in the task (i.e. external world) rather than a difference in strategy (internal to the subject). It seems plausible that the higher value of the delayed reward option when the delay is shorter could account for the high probability of choosing this option irrespective of the current value of the immediate reward option, without appealing to the subjects using a different strategy.

      Further, even if the differences in behaviour do reflect different behavioural strategies, it is not obvious that these correspond to allocation of different types of cognitive effort. For example, subjects' failure to modify their choice probabilities to track the changing value of the immediate reward option might be due simply to valuing the delayed reward option higher, rather than not allocating cognitive effort to tracking immediate option value (indeed this is suggested by the neural data). Conversely, if the rats assign higher value to the delayed reward option in the G1 sessions, it is not obvious that choosing it requires overcoming 'resistance' through cognitive effort.

      The RL modelling used to characterise the subject's behavioural strategies made some unusual and arguably implausible assumptions:

      Thank you for the feedback, based on these comments (and those above) we have completely reworked the RL model. In addition, we’ve taken care to separate out the variables that correspond to a resistance- versus a resource-based signal. 

      There were also some issues with the analyses of neural data which preclude strong confidence in their conclusions:

      Figure 4I makes the striking claim that ACC neurons track the value of the immediately rewarding option equally accurately in sessions where two putative behavioural strategies were used, despite the behaviour being insensitive to this variable in the G1 strategy sessions. The analysis quantifies the strength of correlation between a component of the activity extracted using a decoding analysis and the value of the immediate reward option. However, as far as I could see this analysis was not done in a cross-validated manner (i.e. evaluating the correlation strength on test data that was not used for either training the MCML model or selecting which component to use for the correlation). As such, the chance level correlation will certainly be greater than 0, and it is not clear whether the observed correlations are greater than expected by chance.

      We have added more rigorous methods to assess the ival tracking signal (Figure 4 and 5). In addition, we’ve dropped the claim that ival tracking is the same across the behavioral groups. We suspect that this was an artifact of a suboptimal group assignment approach in the previous version. 

      An additional caveat with the claim that ACC is tracking the value of the immediate reward option is that this value likely correlates with other behavioural variables, notably the current choice and recent choice history, that may be encoded in ACC. Encoding analyses (e.g. using linear regression to predict neural activity from behavioural variables) could allow quantification of the variance in ACC activity uniquely explained by option values after controlling for possible influence of other variables such as choice history (e.g. using a coefficient of partial determination).

      We agree that the ival tracking signal may be influenced by other variables – especially ones that are not cognitive but rather more generated by the autonomic system. We have included a discussion of this possibility in the Discussion section. Our previous work has explored the role of choice history on neural activity, please see White et al., (2024). 

      Figure 5 argues that there are systematic differences in how ACC neurons represent the value of the immediate option (ival) in the G1 and G2 strategy sessions. This is interesting if true, but it appears possible that the effect is an artefact of the different distribution of option values between the two session types. Specifically, due to the way that ival is updated based on the subjects' choices, in G1 sessions where the subjects are mostly choosing the delayed option, ival will on average be higher than in G2 sessions where they are choosing the immediate option more often. The relative number of high, medium and low ival trials in the G1 and G2 sessions will therefore be different, which could drive systematic differences in the regression fit in the absence of real differences in the activity-value relationship. I have created an ipython notebook illustrating this, available at: https://notebooksharing.space/view/a3c4504aebe7ad3f075aafaabaf93102f2a28f8c189ab9176d48 07cf1565f4e3. To verify that this is not driving the effect it would be important to balance the number of trials at each ival level across sessions (e.g. by subsampling trials) before running the regression.

      This is an excellent point and lead us to abandon the linear regression-based approach to quantify differences in ival coding across behavioral groups.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      This paper was extremely hard to read. In addition to the issues raised in the public review (overly complex and incomplete analyses), one of the hardest things to deal with was the writing.

      Thank you for the feedback. Hopefully we have addressed this with our thorough rewrite. 

      The presentation was extremely hard to follow. I had to read through it several times to figure out what the task was. It wasn't until I got to the RL model Figure 2A that I realized what was really going on with the task. I strongly recommend having an initial figure that lays out the actual task (without any RL or modeling assumptions) and identifies the multiple different kinds of sessions. What is the actual data you have to start with? That was very unclear.

      Excellent idea. We have implemented this in Figure 1.  

      Labeling session by "group" is very confusing. I think most readers take "group" as the group of subjects, but that's not what you mean at all. You mean some sessions were one way and some were another. (And, as I noted in the public review, you ignore many of the sessions, which I think is not OK.) I think a major rewrite would help a lot. Also, I don't think the group analysis is necessary at all. In the public review, I recommend doing the analyses very differently and more classically.

      We have updated the group assignments in a manner that is more intuitive, reflects the delays, and includes all sessions.  

      The paper is full of arbitrary abbreviations that are completely unnecessary. Every time I came to "ival", I had to translate that into "number of pellets delivered on the immediate lever" and every time I came to dLP, I had to translate that into "delayed lever press". Making the text shorter does not make the text easier to read. In general, I was taught that unless the abbreviation is the common term (such as "DNA" not "deoxyribonucleic acid"), you should never use an abbreviation. While there are some edge cases (ACC probably over "anterior cingulate cortex"), dLP, iLP, dLPs, iLPs, ival, are definitely way over the "don't do that" line.

      We completely agree here and apologize for the excessive use of abbreviations. We have removed nearly all of them

      The figures were incomplete, poorly labeled, and hard to read. A lot of figures were missing, for example

      Basic task structure

      Basic behavior on the task

      Scatter plot of the measures that you are clustering (lever press choice X number of pellets on the immediate lever, you can use color or multiple panels to indicate the delay to the delayed lever) Figure 3 is just a couple of examples. That isn't convincing at all.

      Figure 4 is missing labels. In Figure 4, I don't understand what you are trying to say.

      I don't see how the results on page 16 arise from Figure 6. I strongly recommend starting from the actual data and working your way to what it means rather than forcing this into this unreasonable "session group" analysis.

      We have completely reworked the Figures for clarity and content. 

      The statement that "no prior study has explored the cellular correlates of cognitive effort" is ludicrous and insulting. There are dozens of experiments looking at ACC in cognitive effort tasks, in humans, other primates, and rodents. There are many dozens of experiments looking at cellular correlates in intertemporal choice tasks, some with neural manipulations, some with ensemble recordings. There are many dozens of experiments looking at cellular relationships to waiting out a delay.

      We agree that our statement was extremely imprecise. We have updated this to say:  “Further, a role for theta oscillations in allocating physical effort has been identified. However, the cellular

      mechanisms within the ACC that control and deploy types of cognitive effort have not been identified.”

      Reviewer #2 (Recommendations For The Authors):

      In Figure 2, the panels below E and F are referred to as 'right' - but they are below? I would give them letters.

      I would make sure that animal #s, neuron #s, and LFP#s are clearly presented in the results and in each figure legend. This is important to follow the results throughout the manuscript.

      Some additional proofreading ('Fronotmedial') might help with clarity.

      Based on our updates, this is no longer relevant.  

      Reviewer #3 (Recommendations For The Authors):

      In addition to the suggestions above to address specific issues, it would be useful to report some additional information about aspects of the experiments and analyses:

      Specify how spike sorting was performed and what metrics were used to select well isolated single units.

      Done.

      Provide histology showing the recording locations for each subject.

      Histological assessments of electrodes placements are provided in White et al. 2024, but we provide an example placement. This has been added to the text. 

      Indicate the sequence of recording sessions that occurred for each subject, including for each session what delay duration was used and which dataset the session contributed to, and indicate when the neural probes were advanced between sessions.

      We feel that this adds complexity unnecessarily as we make no claims about holding units across sessions for differences in coding in the dorsoventral gradient of ACC. 

      Indicate the experimental unit when reporting uncertainty measures in figure legends (e.g. mean +/- SEM across sessions).

      Done.

    1. Author response:

      Before providing a brief provisional response to the two reviews, it is important to reiterate a few key points about our work. First, our paper is largely a computational biophysics paper, augmented by experimental results. Generally speaking, computational biophysics work intends to achieve one of two things (or both). One is to provide more molecular level insight into various behaviors of biomolecular systems that have not been (or cannot be) provided by qualitative experimental results alone. The second general goal of computational biophysics it to formulate new hypotheses to be tested subsequently by experiment. In our paper, we have achieved both of these goals and then confirmed the key computational results by experiment..

      The first reviewer has some valuable points, which can be addressed as follows (and will be emphasized in the revised version of the paper): (1) Yes the simulations of capsid rupture in the NPC and capsid-only are directly comparable as both have approximately the same number of bound LEN, as determined by following the LEN-capsid interaction protocol described in the main text (around Fig 6) and in the SI section S3; (2) While we have stressed this point in several places in the manuscript, here again we stress that coarse-grained (CG) MD time is not the same as real time. The point of CG simulations is to accelerate the timescale of the MD and the associated sampling, so the CG “time” from the MD integrator needs to be rescaled to associate a real time to it. As such, our CG simulation is not representing a microsecond of real time but rather something much longer. We will emphasize this again in the revised text. (3) Actually, we think that the parameterization of the LEN model and the LEN-capsid interactions is well described in the text associated with Fig 6 and in SI section S3. It is true that this one part of the CG model was parameterized “top-down” given the good experimental structures of bound LEN to capsid and other data, but the rest of the CG model is “bottom-up” (meaning developed from well-defined coarse-graining statistical mechanics as applied to molecular level structures and interactions, see also below). 

      As for the second reviewer, this review is quite problematic in our view as the reviewer seems to think that quoting a number of qualitative experimental results is sufficient to undermine the impact of our paper (they are not) and, furthermore, the reviewer appears to have a very minimal understanding of “bottom-up” CG modeling, which we have utilized. This modeling does not in fact rely on the “assumptions” this reviewer alleges we have relied on. (As an aside, it could be helpful for this reviewer to study the review by Jin et al, https://doi.org/10.1021/acs.jctc.2c00643) in order to become more familiar with the field and our approach before criticizing it.) We also note that our main HIV capsid-NPC docking model is already published in PNAS (https://doi.org/10.1073/pnas.2313737121), where it underwent rigorous peer review. In our forthcoming full response to the reviews and in the revised paper we will attempt to address a number of this reviewers comments, but the number, extent, and tone of this collection of criticisms, for us, calls into question the objectivity of this reviewer, not to mention the reviewer’s rather weak understanding of what we have done and how we have done it.

      Finally, while we certainly appreciate the overall positive eLife assessment, we are disappointed by the statement “some mechanistic interpretations rely on assumptions embedded in the simulations, leaving parts of the evidence incomplete”. Of course, all simulations (and experiments) rely on certain assumptions, but we have gone to great length to provide a “bottomup” approach to our modeling, based on underlying molecular level structures and interactions, and we have provided experimental validation of the main simulation predictions. It seems that the comments of the second reviewer may have influenced this point of view, but we do not feel it is justified.

    1. Reviewer #2 (Public review):

      Summary:

      Previous studies by some of the same authors of the actual manuscript showed that healthy human newborns memorize recently learned nonsense words. They exposed neonates to a familiarization period (several minutes) when multiple repetitions of a bisyllabic word were presented, uttered by the same speaker. Then they exposed neonates to an "interference period" when newborns listened to music or the same speaker uttering a different pseudoword. Finally, neonates were exposed to a test period when infants hear the familiarized word again. Interestingly, when the interference was music, the recognition of the word remained. The word recognition of the word was measured by using the NIRS technique, which estimates the regional brain oxygenation at the scalp level. Specifically, the brain response to the word in the test was reduced, unveiling a familiarity effect, while an increase in regional brain oxygenation corresponds to the detection of a "new word" due to a novelty effect. In previous studies, music does not erase the memory traces for a word (familiarity effect), while a different word uttered by the same speaker does.

      The current study aims at exploring whether and how word memory is interfered with by other speech properties, specifically the changes in the speaker, while young children can distinguish speakers by processing the speech. The author's main hypothesis anticipates that new speaker recognition would produce less interference in the familiarized word because somehow neonates "separate" the processing of both words (familiarized uttered by one speaker, and interfering word, uttered by a different speaker), memorizing both words as different auditory events.

      From my point of view, this hypothesis is interesting, since the results would contribute to estimating the role of the speaker in word learning and speech processing early in life.

      Strengths:

      (1) New data from neonates. Exploring neonates' cognitive abilities is a big challenge, and we need more data to enrich the knowledge of the early steps of language acquisition.

      (2) The study contributes new data showing the role of speaker (recognition) on word learning (word memory), a quite unexplored factor. The idea that neonates include speakers in speech processing is not new, but its role in word memory has not been evaluated before. The possible interpretation is that neonates integrate the process of the linguistic and communicative aspects of speech at this early age.

      (3) The study proposes a quite novel analytic approach. The new mixed models allow exploring the brain response considering an unbalanced design. More than the loss of data, which is frequent in infants' studies, the familiarization, interference and learning processes may take place at different moments of the experiment (e.g. related to changes in behavioural states along the experiment) or expressed in different regions (e.g. related to individual variations in optodes' locations and brain anatomy).

      Weaknesses:

      I did not find major weaknesses. However, I would like to have more discussion or explanation on the following points.

      (1) It would be fine to report the contribution of each infant to the analysis, i.e. how many good blocks, 1 to 5 in sequence 1 and 2, were provided by each infant.

      (2) Why did the factor "blocknumber" range from 0 to 4? The authors should explain what block zero means and why not 1 to 5.

      (3) I may suggest intending to integrate the changes in brain activity across the 3 phases. That is, whether changes in familiarization relate to changes in the test and interference phases. For instance, in Figure 2, the brain response distinguishes between same and novel words that occurred over IFG and STG in both hemispheres. However, in the right STG there was no initial increase in the brain response, and the response for the same was higher than the one for novels in the 5th block.

      (4) Similarly, it is quite amazing that the brain did not increase the activity with respect to the familiarization during the interference phase, mainly over the left hemisphere, even if both the word and speaker changed. Although the discussion considers these findings, an integrated discussion of the detection of novel words and the detection of a novel speaker over time may benefit from a greater integration of the results.

      Appraisal:

      The authors achieved their aims because the design and analytic approaches showed significant differences. The conclusions are based on these results. Specifically, the hypothesis that neonates would memorize words after interference, when interfered speech is pronounced by a different speaker, was supported by the data in blocks 2 and 5, and the potential mechanisms underlying these findings were discussed, such as separate processing for different speakers, likely related to the recognition of speaker identity.

      I think the discussion is well-structured, although I may suggest integrating the changes into the three phases of the study. Maybe comparing with other regions, not related to speech processing.

      Evaluating neonates is a challenge. Because physiology is constantly changing. For instance, in 9 minutes, newborns may transit from different behavioral states and experience different physiological needs.

      This study offers the opportunity to inspire looking for commonalities and individual differences when investigating early memory capacities of newborns.

    1. The rhythm of today, like every day we have lived here on Turtle Island, is made possible through the historic and ongoing processes and ideologies of colonialism. Importantly, it is also made possible through ongoing and persistent resistance to colonialism.

      This made me think about how in my own household, which I manage mostly on my own, the things I rely on, such as housing and education systems and how they exist because of the colonial systems, even though they may appear to me as just "regular" life.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors develop a biologically plausible recurrent neural network model to explain how the hippocampus generates and uses barcode-like activity to support episodic memory. They address key questions raised by recent experimental findings: how barcodes are generated, how they interact with memory content (such as place and seed-related activity), and how the hippocampus balances memory specificity with flexible recall. The authors demonstrate that chaotic dynamics in a recurrent neural network can produce barcodes that reduce memory interference, complement place tuning, and enable context-dependent memory retrieval, while aligning their model with observed hippocampal activity during caching and retrieval in chickadees.

      Strengths:

      (1) The manuscript is well-written and structured.

      (2) The paper provides a detailed and biologically plausible mechanism for generating and utilizing barcode activity through chaotic dynamics in a recurrent neural network. This mechanism effectively explains how barcodes reduce memory interference, complement place tuning, and enable flexible, context-dependent recall.

      (3) The authors successfully reproduce key experimental findings on hippocampal barcode activity from chickadee studies, including the distinct correlations observed during caching, retrieval, and visits.

      (4) Overall, the study addresses a somewhat puzzling question about how memory indices and content signals coexist and interact in the same hippocampal population. By proposing a unified model, it provides significant conceptual clarity.

      Weaknesses:

      The recurrent neural network model incorporates assumptions and mechanisms, such as the modulation of recurrent input strength, whose biological underpinnings remain unclear. The authors acknowledge some of these limitations thoughtfully, offering plausible mechanisms and discussing their implications in depth.

      One thread of questions that authors may want to further explore is related to the chaotic nature of activity that generates barcodes when recurrence is strong. Chaos inherently implies sensitivity to initial conditions and noise, which raises questions about its reliability as a mechanism for producing robust and repeatable barcode signals. How sensitive are the results to noise in both the dynamics and the input signals? Does this sensitivity affect the stability of the generated barcodes and place fields, potentially disrupting their functional roles? Moreover, does the implemented plasticity mitigate some of this chaos, or might it amplify it under certain conditions? Clarifying these aspects could strengthen the argument for the robustness of the proposed mechanism.

      In our model, chaos is used to produce a random barcode when forming memories, but memory retrieval depends on attractor dynamics. Specifically, the plasticity update at the end of the cache creates an attractor state, and then afterwards for successful memory retrieval the network activity must settle into this attractor rather than remaining chaotic. This attractor state is a conjunction of memory content (place and seed activity) and memory index (barcode activity). Thus a barcode is ‘reactivated’ when network dynamics during retrieval settle into this cache attractor, or in other words chaotic dynamics do not need to generate the same barcode twice.

      The reviewer raises an important point, which is how sensitivity to initial conditions and noise would affect the reliability of our proposed mechanism. The key question here is how noise will affect the network’s dynamics during retrieval. Would adding noise to the dynamics make memory retrieval more difficult? We thank the reviewer for suggesting we investigate this further, and below describe our experiments and changes to the manuscript to better address this topic.

      We first experimented with adding independent gaussian distributed noise into each unit, drawn independently at each timestep. We analyzed recall accuracy using the same task and methods as Fig. 4F while varying the magnitude of noise. Memory recall was quite robust to this form of noise, even as the magnitude of noise approached half of the signal amplitude. This first experiment added noise into the temporal dynamics of the network. We subsequently examined adding static noise into the network inputs, which can also be thought of as introducing noise into initial conditions. Specifically, we added independent gaussian distributed noise into each unit, with the random value held constant for the extent of temporal dynamics. This perturbation decreased the likelihood of memory recall in a graded manner with noise magnitude, without dramatically changing the spatial profile. Examination of dynamics on individual trials revealed that the network failed to converge onto a cache attractor on some random fraction of trials, with other trials appearing nearly identical to noiseless results. We now include these results in the text and as a new supplementary figure, Figure S4AB.

      To clarify the network dynamics and the purpose of chaos in our model, we make the following modifications in text:

      Section 2.3, paragraph 2 (starting at “To store memories…”):

      “…place inputs arrive into the RNN, recurrent dynamics generate an essentially random barcode, seed inputs are activated, and then Hebbian learning binds a particular pattern of barcode activity to place- and seed-related activity.”

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l, resulting in the storage of an attractor \vec{a} into the RNN. The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for location l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      Section 2.4, final paragraph (starting “We further examined how model hyperparameters affected performance on these tasks”), added the following describing new results on adding noise: We found that adding noise to the network's temporal dynamics had little effect on memory recall performance (Figure S4A). However, large static noise vectors added to the network's input and initial state decreased the overall probability of memory recall, but not its spatial profile (Figure S4B).

      It may also be worth exploring the robustness of the results to certain modeling assumptions.  For instance, the choice to run the network for a fixed amount of time and then use the activity  at the end for plasticity could be relaxed.

      As described above, chaotic dynamics are necessary to generate a barcode during a cache, but not to reactivate that barcode during retrieval. During a successful memory retrieval, network activity settles into an attractor state and thus does not depend on the duration of simulated dynamics. The choice of duration to run dynamics during caching is important, but only insofar as activity significantly decorrelates from the initial state. We show in Figure S1B that decorrelation saturates ~t=25, and thus any random time point t > 25 would be similarly effective. We used a fixed duration runtime for caches only to avoid introducing unnecessary complication into our model.

      Reviewer #2 (Public review):

      Summary:

      Striking experimental results by Chettih et al 2024 have identified high-dimensional, sparse patterns of activity in the chickadee hippocampus when birds store or retrieve food at a given site. These barcode-like patterns were interpreted as "indexes" allowing the birds to retrieve from memory the locations of stored food.

      The present manuscript proposes a recurrent network model that generates such barcode activity and uses it to form attractor-like memories that bind information about location and food. The manuscript then examines the computational role of barcode activity in the model by simulating two behavioral tasks, and by comparing the model with an alternate model in which barcode activity is ablated.

      Strengths of the study:

      Proposes a potential neural implementation for the indexing theory of episodic memory - Provides a mechanistic model of striking experimental findings: barcode-like, sparse patterns of activity when birds store a grain at a specific location

      A particularly interesting aspect of the model is that it proposes a mechanism for binding discrete events to a continuous spatial map, and demonstrates the computational advantages of this mechanism.

      Weaknesses:

      The relation between the model and experimentally recorded activity needs some clarification

      The relation with indexing theory could be made more clear

      The importance of different modeling ingredients and dynamical mechanisms could be made more clear

      The paper would be strengthened by focusing on the most essential aspects

      Comments:

      The model distinguishes between "barcode activity" and "attractors". Which of the two corresponds to experimentally-recorded barcodes? I would presume the attractors. A potential issue is that the attractors are, as explained in the text (l.137), conjunctions of place activity, barcode activity and "seed" inputs. The fact that the seed activity is shared across attractors seems to imply that they have a non-zero correlation independent of distance. Is that the case in the model? If I understand correctly, Fig 3D shows correlations between an attractor and barcodes at different locations, but correlations between attractors at different locations are not shown. Fig 1 F instead shows that correlations between recorded retrieval activities decay to zero with distance.

      More generally, the fact that the expression "barcode" is apparently used with different meanings in the model and in the experiments is potentially confusing (in the model they correspond to activity generating during caching, and this activity is distinct from the memories; my understanding is that in the experiments barcodes correspond to both caching and retrieval, but perhaps I am mistaken?).

      Our intent is to use the expression “barcode” as similarly as possible between model and experimental work. The reviewer points out that the connection between barcodes in experimental and modeling work is unclear, as well as the relation of “attractors” in our model to previous experimental results. The meaning of ‘barcode’ is absolutely critical—we clarify below our intended meaning, and then describe changes to the manuscript to highlight this.

      In experiments, we observed that activity during caching looked different than ordinary hippocampal activity (i.e. typical “place activity” observed during visits). Empirically there were two major differences. First, there was a pattern of neural activity which was present during every cache . This pattern was also present when birds visually inspected sites containing a cached seed, but not when visually inspecting an empty site. This is what we refer to as “seed activity”. Second, there was a pattern of neural activity which was unique to each cache. This pattern re-occurred during retrieval, and was orthogonal to place activity (see Fig. 1E-F). This is what we refer to as “barcode activity”. In summary, activity during a cache (or retrieval) contains a combination of three components: place activity, seed activity, and barcode activity.

      These experimental findings are recapitulated in our model, as activity during a cache contains a combination of three components: place activity driven by place inputs, seed activity driven by seed inputs, and barcode activity generated by recurrent dynamics. Cache activity in the model corresponds to cache activity in experiments, and barcodes in the model correspond to barcodes in experiments. Our model additionally has “attractors”, meaning that network connectivity changes so that the activity generated during a simulated cache becomes an attractor state of network dynamics. “Attractors” refers to a feature of network dynamics, not a distinct activity state, and we do not yet know if these attractors exist in experimental data.

      Figure 3D, as described in the figure legend, is a correlation of activity during cache and retrieval (in purple), for cache-retrieval pairs at the same or at different sites. We believe this is what the reviewer asks to see: the correlation between attractor states for different cache locations. The reviewer makes an important point: seed activity is shared across all attractors, so then why are correlations not high for all locations? This is because attractors also have a place component, which is anti-correlated for distant locations. This is evident in Fig. 3D by noticing that visit-visit correlations (black line, corresponding to place activity only) are negative for distant locations, and the correlation between attractors (purple line, cache-retrieval pairs) is subtly shifted up relative to the black line (place code only) for these distant locations. The size of this shift is due to the relative magnitude of place and seed inputs. For example, if we increase the strength of the seed input during caching (blue line), we can further increase the correlation between attractors even for quite distant sites:

      Author response image 1.

      To clarify the manuscript, we made the following modifications:

      Section 2.2, first paragraph: We model the hippocampus as a recurrent neural network (RNN) (Alvarez and Squire, 1994; Tsodyks, 1999; Hopfield, 1982) and propose that recurrent dynamics can generate barcodes from place inputs. As in experiments, the model’s population activity during a cache should exhibit both place and barcode activity components.

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l , resulting in the storage of an attractor \vec{a} into the RNN . The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated as part of attractor \vec{a}, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      The insights obtained from the network model for the computational role of barcode activity could be explained more clearly. The introduction starts by laying out the indexing theory, which proposes that the hippocampus links an index with each memory so that the memory is reactivated when the index is presented. The experimental paper suggests that the barcode activations play the role of indexes. Yet, in the model reactivations of memories are driven not by presenting bar-code activity, but by presenting place activity (Cache Presence task) or seed activity (Cache Location task). So it seems that either place activity and seed activity play the role of indexes. Section 2.5 nicely shows that ultimately the role of barcode activity is to decorrelate attractors, which seems different from playing the role of indexes. I feel it would be useful that the Discussion reassess more critically the relationship between barcodes, indexing theory, and key-value architectures.

      The reviewer highlights a failure on our part to clearly identify the connection between our findings on barcodes, indexing theory, and key-value architectures. This is another major component of the paper, and below we propose changes to the manuscript to clarify these concepts and their relationships. First, we will summarize the key points that were unclear in our original manuscript.

      The reviewer equates the concept of an ‘index’ with that of a ‘query’: the signal that drives memory reactivation. This may be intuitive, but it is not how a memory index was defined in indexing theory (e.g. Teyler & DiScenna 1986). In indexing theory, the index is a pattern of hippocampal activity that is (a) generated during memory formation, (b) separate from the activity encoding memory content, and (c) linked to memory content via associative plasticity. After memory formation, a memory might be queried by activating a partial set of the memory contents, which would then drive reactivation of the hippocampal index, leading to pattern completion of memory contents. See, for example, figure 1 of Teyler and DiScenna 1986. The ‘index’ is thus not the same as the ‘query’ that drives recall.

      We propose in this work that barcode activity is such an index. Indexing theory originally posited that memory content was encoded by neocortex, and memory index was encoded by hippocampus. However the experiments of Chettih et al. 2024 revealed that the hippocampus contained both memory content and memory index signals, and furthermore there was no division of cells into ‘content’ and ‘index’ subtypes. Thus our model drops the assumption of earlier work that index and content signals correspond to different neurons in different brain areas—a significant advance of our work. Otherwise, the experimentally observed barcodes and the barcodes generated by our computational model play the role of indices as originally defined.

      Our original manuscript was unclear on the relationship of indexing theory and key-value systems. Our work connects diverse areas of memory models, including attractor dynamics, key-value memory systems, and memory indexing. A full account of these literatures and their relationships may be beyond the scope of this manuscript, and we note that a recent review article (Gershman, Fiete, and Irie, 2025) further clarifies the relationship between key-value memory, indexing theory, and the hippocampus. We will cite this work in our discussion as a source for the interested reader.

      Briefly, a key-value memory system distinguishes between the address where a memory is stored, the ‘key’, and the content of that memory, the ‘value’. An advantage of such systems is that keys can be optimized for purposes independent of the value of each memory. The use of barcodes in our model to decorrelate memories is related to this optimization of keys in key-value memory systems. By generating barcodes and adding this to the attractor state corresponding to a cache memory, the ‘address’ of the memory in population activity is differentiated from other memories. Our work is thus consistent with the idea that hippocampus generates keys and implements a key storage system. However it is not so straightforward to equate barcodes with keys, as they are defined in key-value memory. As the reviewer points out, memory recall can be driven by location and seed inputs, i.e. it is content-addressable. We think of the barcode as modifying the memory address to better separate similar memories, without changing memory content, and the resulting memory can be recalled by querying with either content or barcode. Given the complex and speculative nature of these relationships, we prefer to note the salient connection of our work with ongoing efforts applying the key-value framework to biological memory, and leave the precise details of this connection to future work.

      We make the following changes in the manuscript to clarify these ideas:

      Introduction, first paragraph: In this scheme, during memory formation the hippocampus generates an index of population activity, and the neurons representing this index are linked with the neurons representing memory content by associative plasticity . Later, re-experience of partial memory contents may reactivate the index, and reactivation of the index drives complete recall of the memory contents.

      Discussion, 4th paragraph on key-value: Interestingly, prior theoretical work has suggested neural implementations for both key-value memory and attention mechanisms, arguing for their usefulness in neural systems such as long term memory (Kanerva, 1988; Tyulmankov et al., 2021; Bricken and Pehlevan, 2021; Whittington et al., 2021; Kozachkov et al., 2023; Krotov and Hopfield, 2020; Gershman 2025 ). In this framework, the address where a memory is stored (the key) may be optimized independently of the value or content of the memory. In our model, barcodes improve memory performance by providing a content-independent scaffold that binds to memory content, preventing memories with overlapping content from blurring together. Thus barcodes can be considered as a change in memory address, and our model suggests important connections between recurrent neural activity and key generation mechanisms. However we note that barcodes should not be literally equated with keys in key-value systems as our model’s memory is ‘content-addresable’—it can be queried by place and seed inputs.

      The model includes a number of non-standard ingredients. It would be useful to explain which of these ingredients and which of the described mechanisms are essential for the studied phenomenon. In particular:

      - the dynamics in Eq.2 include a shunting inhibition term. Is it essential and why?

      The shunting inhibition is important as it acts to normalize the network activity to prevent runaway excitation. We hope to clarify this further by amending the following sentence in section 2.2: “g (·) is a leak rate that depends on the average activity of the full network, representing a form of global shunting inhibition that normalizes network activity to prevent runaway excitation from recurrent dynamics.”

      - same question for the global inhibition included in the random connectivity;

      The distribution from which connectivity strengths are drawn has a negative mean (global inhibition). This causes activity during caching (i.e. r = 1) to be sparser than activity during visits (i.e. r = 0), and was chosen to match experimental findings. In figures 2B and S2B we show that our model can transition between a mode with place code only, barcode only, or a mode containing both, by changing the variance of the weight distribution while holding the mean constant. We suggest clarifying this by editing the following in section 2.2, paragraph 2: “We initialize the recurrent weights from a random Gaussian distribution, . where 𝑁<sub>𝑋</sub> is the number of RNN neurons and μ < 0, reflecting global subtractive inhibition that encourages sparse network activity to match experimental findings (Chettih et al. 2024).”

      - the model is fully rate-based, but for certain figures, spikes are randomly generated. This seems superfluous.

      Spikes are simulated for one analysis and one visualization, where it is important to consider noise or variability in neural responses across trials. First, for Fig. 2H,J, we generated spikes to allow a visual comparison to figures that can be easily generated from experimental data. Second, and more significantly, for the analysis underlying Fig. 3D, it is essential to simulate variability in neural responses. Because our rate-based models are noiseless, the RNN’s rate vector at site distance = 0 will always be the same and result in a correlation of 1 for both visit-visit and cache-retrieval. However, we show that, if one interprets the rate as a noisy Poisson spiking process, the correlation at site distance = 0 between a cache-retrieval pair is higher than that of two visits. This is because under a Poisson spiking model, the signal-to-noise ratio is higher for cache-retrieval activity, where rates are higher in magnitude. The greater correlation for a cache-retrieval pair at the same site, relative to visits at the same site, is an experimental finding that was critical for our model to reproduce. We detail clarifications to the manuscript below in response to the reviewer’s following and related question.

      How are the correlations determined in the model (e.g., Fig 2 B)? The methods explain that they are computed from Poisson-generated spikes, but over which time period? Presumably during steady-state responses, but are these responses time-averaged?

      The reviewer points out a lack of clarity in our original manuscript. Correlations for events (caches, retrievals and visits) at different sites are calculated in two sections of the paper (2B, 3D), for different purposes and with slight differences in methods:

      - For figure 2B, no spikes are simulated. Note that the methods mentioning poisson spike generation specify only Fig. 2H,J and Fig. 3D. We simply take the network’s rate vector at timestep t=100 (when the decorrelating effect of chaotic dynamics has saturated, S1A-B) and correlate this vector when generated at different locations. We now clarify this in the legend for Figure 2B: “We show correlation of place inputs (gray) and correlation of the RNN's rate vector at t = 100 (black).”

      - For Figure 3D, we want to compare the model to empirical results from Chettih et al. 2024, and reproduced in this paper in Fig. 1E-F. These empirical results are derived from correlating vectors of spiking activity on pairs of single trials, and are thus affected by noise or variability in neural responses as described in our response to the reviewer’s previous question. We thus took the RNN’s rate vector at t=100 and simulated spiking data by drawing samples from a poisson distribution to get spike counts. Our original manuscript was unclear about this, and we suggest the following changes:

      - Legend for Figure 3D: D. Correlation of Poisson-generated spikes simulated from RNN rate vectors at two sites, plotted as a function of the distance between the two sites.

      - Section 2.3, last paragraph: Population activity during retrieval closely matches activity during caching, and is substantially decorrelated from activity during visits (Figure 3C). To compare our model with the empirical results reproduced in Figure 1E,F, we ran in silico experiments with caches and retrievals at varying sites in the circular arena. We simulated Poisson-generated spikes drawn from our network's underlying rates to match the intrinsic variability in empirical data (see Methods).

      - Methods, subsection Spatial correlation of RNN activity for cache-retrieval pairs at different sites: To calculate correlation values as in Figure \ref{fig3}D, we simulated experiments where 5 sites were randomly chosen for caching and retrieval. To compare model results to the empirical data in Fig. 1E,F, which includes intrinsic neural variability, we sampled Poisson-generated spike counts from the rates output by our model. Specifically, for RNN activity \vec{r_i} at location i, using the rates at t=100 as elsewhere, we first generate a sample vector of spikes…

      I was confused by early and late responses in Fig 2 C. The text says that the activity is initialized at zero, so the response at t=0 should be flat (and zero). More generally, I am not sure I understand why the dynamics matter for the phenomenon at all, presumably the decorrelation shown in Fig 2B depends only on steady state activity (cf previous question).

      Thanks for catching this mistake. The legend has been updated to indicate that the ‘early’ response is actually at t=1, when network activity reflects place inputs without the effects of dynamics. The reviewer is correct that we are primarily interested in the ‘late’ response of the network. All other results in the paper use this late response at t=100. As shown in Fig. S2A,B, this timepoint is not truly a steady state, as activity in the network continues to change, but the decorrelation of network activity with place-driven activity has saturated.

      We include the early response in Fig. 2C for visual comparison of the purely place-driven early activity with the eventual network response. It is also relevant since, as the reviewer points out above, there is a shunting inhibition term in the dynamics that is present during both low and high recurrent strength simulations.

      Related to the previous point, the discussion of decorrelation (l.79 - 97) is somewhat confusing. That paragraph focuses on chaotic activity, but chaos decorrelates responses across different time points. Here the main phenomenon is the decorrelation of responses across different spatial inputs (Fig 2B). This decorrelation is presumably due to the fact that different inputs lead to different non-trivial steady-state responses, but this requires some clarification. If that is correct, the temporal chaos adds fluctuations around these non-trivial steady-state responses, but that alone would not lead to the decorrelation shown in Fig 2B.

      We agree with the reviewer that chaotic activity produces a decorrelation across time points. Because of chaotic dynamics, network activity does not settle into a trivial steady-state, and instead evolves from the initial state in an unpredictable way. The network does not settle into a steady-state pattern, but both the decorrelation of network state with initial state and the rate of change in the network state saturate after ~t=25 timesteps, as shown in Fig. S2A-B.

      The initial activity for nearby states is similar, due to them receiving similar place inputs.

      Because network activity is chaotically decorrelated from this initial state by temporal dynamics, ‘late stage’ network activity between nearby spatial states is less correlated than ‘early stage’ activity. Thus the temporal decorrelation produces a spatial decorrelation. We believe that the changes we have introduced to the manuscript in revision will make this point clearer in our resubmission.

      A key ingredient of the model is that the recurrent interactions are switched on and off between "caching" and "visits". The discussion argues that a possible mechanism for this is recurrent inhibition (l.320), which would need to be added. However two forms of inhibition are already included in the model. The text also says that it is unclear how units in the model should be mapped onto E and I neurons. However the model makes explicit assumptions about this, in particular by generating spikes from individual neurons. Altogether, I did not find that part of the Discussion convincing.

      We agree with the reviewer that this section is a limitation of our current work, and in fact it is an ongoing area of future research. However we think the advances in this current work warrant publication despite this topic requiring further research. We attempted to discuss this limitation explicitly, and note that the other reviewer pointed this section out as particularly helpful. We do not think it is problematic for a realistic model of the brain to ultimately include 3, or even more forms of inhibition. We do not think that poisson-generated spikes commit us to interpreting network units as single neurons. Spikes are not a core part of our model’s mechanism, and were used only as a mechanism of introducing variability on top of deterministic rates for specific analyses. Furthermore one could still view network units as pools of both E and I spiking neurons. We would welcome further recommendations the reviewer believes are important to note in this section on our model’s limitations.

      On lines 117-120 the text briefly mentions an alternate feed-forward model and promptly discards it. The discussion instead says that a "separate possibility is that barcodes are generated in a circuit upstream of where memories are stored, and supplied as inputs to the hippocampal population", and that this possibility would lead to identical conclusions. The two statements seem a bit contradictory. It seems that the alternative possibility would replace the need for switching on and off recurrent interactions, with a mechanism where barcode inputs are switched on and off. This alternate scenario is perhaps more plausible, so it would be useful to discuss it more explicitly.

      We apologize for the confusion here, which seems to be due to our phrasing in the discussion section. We do reject the idea that a simple feed-forward model could generate the spatial correlation profile observed in data, as mentioned in the text and included as Fig. S2. Our statement in the discussion may have seemed contradictory because here we intended to discuss the possibility that an upstream area generates barcodes, for example by the chaotic recurrent dynamics proposed in our work, while a downstream network receives these barcodes as inputs and undergoes plasticity to store memories as attractors. We did not intend to suggest any connection to the feedforward model of barcode generation, and apologize for the confusion. Our claim that this ‘2 network’ solution would lead to similar conclusions is because the upstream network would need an efficient means of barcode generation, and the downstream network would need an efficient means of storing memory attractors, and separating these functions into different networks is not likely to affect for example the advantage of partially decorrelating memory attractors. Moreover, the downstream network would still require some form of recurrent gating, so that during visits it exhibits place activity without activating stored memory attractors!

      We thus chose a 1 network instead of a 2 network solution because it was simpler and, we believe, more interesting. It is challenging in the absence of more data to say which is more plausible, thus we wanted to mention the possibility of a 2 network solution. We suggest the following changes to the manuscript:

      - Discussion, 3rd paragraph: “Alternatively, other mechanisms may be involved in generating barcodes. We demonstrated that conventional feed-forward sparsification (Babadi and Sompolinsky, 2014; Xie et al., 2023) was highly inefficient, but more specialized computations may improve this (Földiak, 1990; Olshausen and Field, 1996; Sacouto and Wichert, 2023; Muscinelli et al., 2023). Another possibility is that barcodes are generated in a separate recurrent network upstream of the recurrent network where memories are stored. In this 2-network scenario, the downstream network receives both spatial tuning and barcodes as inputs. This would not obviate the need for modulating recurrent strength in the downstream network to switch between input-driven modes and attractor dynamics. We suspect separating barcode generation and memory storage in separate networks would not fundamentally affect our conclusions.”

      As a minor note, the beginning of the discussion states that the presented model is similar to previous recurrent network models of the hippocampus. It would be worth noting that several of the cited works assign a very different role to recurrent interactions: they generate place cell activity, while the present model assumes it is inherited from upstream inputs.

      We are not sure how best to modify the paper to address this suggestion. As far as we know, all of the cited models which deal with spatial encoding do assume that the hippocampus receives a spatially-modulated or spatially-tuned input. For example, the Tsodyks 1999 paper cited in this paragraph uses exponentially-decaying place inputs to each neuron highly similar to our model. Furthermore we explore how our model would perform if we change the format of spatial inputs in Fig. S4, and find key results are unchanged. It is unclear how hippocampal place fields could emerge without inputs that differentiate between spatial locations. We think it is appropriate to highlight the similarity of our model to well known hopfield-type recurrent models, where memories are stored as attractor states of the network dynamics.

      On the other hand, we agree that a common line of hippocampal modeling proposes that recurrent interactions reshape spatial inputs to produce place fields. This often arises in the context of hippocampus generating a predictive map, where inputs may be one-hot for a single spatial state, in a grid cell-like format, or a random projection of sensory features. We attempted to address this in section 2.6, using a model which superimposes the random connectivity needed for barcode generation with the structured connectivity needed for predictive map formation. We found that such a model was able to perform both predictive and barcode functions, suggesting a path forward to connecting different lines of hippocampal modeling in future work.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, Xiong and colleagues investigate the mechanisms operating downstream to TRIM32 and controlling myogenic progression from proliferation to differentiation. Overall, the bulk of the data presented is robust. Although further investigation of specific aspects would make the conclusions more definitive (see below), it is an interesting contribution to the field of scientists studying the molecular basis of muscle diseases.

      We thank the Reviewer for appreciating our work and for their valuable suggestions to improve our manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      In my opinion, a few aspects would improve the manuscript. Firstly, the conclusion that Trim32 regulates c-Myc mRNA stability could be expanded and corroborated by further mechanistic studies:

      1. Studies investigating whether Tim32 binds directly to c-Myc RNA. Moreover, although possibly beyond the scope of this study, an unbiased screening of RNA species binding to Trim32 would be informative. Authors’ response. This point will be addressed as detailed in the Revision Plan

      If possible, studies in which the overexpression of different mutants presenting specific altered functional domains (NHL domain known to bind RNAs and Ring domain reportedly involved in protein ubiquitination) would be used to test if they are capable or incapable of rescuing the reported alteration of Trim32 KO cell lines in c-Myc expression and muscle maturation.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      An optional aspect that might be interesting to explore is whether the alterations in c-Myc expression observed in C2C12 might be replicated with primary myoblasts or satellite cells devoid of Trim32.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      I also have a few minor points to highlight:

        • It is unclear if the differences highlighted in graphs 5G, EV5D, and EV5E are statistically significant.*

      Authors’ response. We thank the Reviewer for raising this point. We now indicated the statistical analyses performed on the data presented in the mentioned figures (according also to a point of Reviewer #3). According to the conclusion that Trim32 is necessary for proper regulation of c-Myc transcript stability, using 2-way-ANOVA, the data now reported as Figure 5G show the statistically significant effect of the genotype at 6h (right-hand graph) but not at D0 (left-hand graph). In the graphs of Fig. EV5 D and E at D0 no significant changes are observed whereas at 6h the data show significant difference at the 40 min time point. We included this info in the graphs and in the corresponding legends.

      - On page 10, it is stated that c-Myc down-regulation cannot rescue KO myotube morphology fully nor increase the differentiation index significantly, but the corresponding data is not shown. Could the authors include those quantifications in the manuscript?

      Authors’ response. As suggested, we included the graph showing the differentiation index upon c-Myc silencing in the Trim32 KO clones and in the WT clones, as a novel panel in Figure 6 (Fig. 6D). As already reported in the text, a partial recovery of differentiation index is observed but the increase is not statistically significant. In contrast, no changes are observed applying the same silencing in the WT cells. Legend and text were modified accordingly.

      Reviewer #1 (Significance (Required)):

      The manuscript offers several strengths. It provides novel mechanistic insight by identifying a previously unrecognized role for Trim32 in regulating c-Myc mRNA stability during the onset of myogenic differentiation. The study is supported by a robust methodology that integrates CRISPR/Cas9 gene editing, transcriptomic profiling, flow cytometry, biochemical assays, and rescue experiments using siRNA knockdown. Furthermore, the work has a disease relevance, as it uncovers a mechanistic link between Trim32 deficiency and impaired myogenesis, with implications for the pathogenesis of LGMDR8. * * At the same time, the study has some limitations. The findings rely exclusively on the C2C12 myoblast cell line, which may not fully represent primary satellite cell or in vivo biology. The functional rescue achieved through c-Myc knockdown is only partial, restoring Myogenin expression but not the full differentiation index or morphology, indicating that additional mechanisms are likely involved. Although evidence supports a role for Trim32 in mRNA destabilization, the precise molecular partners-such as RNA-binding activity, microRNA involvement, or ligase function-remain undefined. Some discrepancies with previous studies, including Trim32-mediated protein degradation of c-Myc, are acknowledged but not experimentally resolved. Moreover, functional validation in animal models or patient-derived cells is currently lacking. Despite these limitations, the study represents an advancement for the field. It shifts the conceptual framework from Trim32's canonical role in protein ubiquitination to a novel function in RNA regulation during myogenesis. It also raises potential clinical implications by suggesting that targeting the Trim32-c-Myc axis, or modulating c-Myc stability, may represent a therapeutic strategy for LGMDR8. This work will be of particular interest to muscle biology researchers studying myogenesis and the molecular basis of muscle disease, RNA biology specialists investigating post-transcriptional regulation and mRNA stability, and neuromuscular disease researchers and clinicians seeking to identify new molecular targets for therapeutic intervention in LGMDR8. * * The Reviewer expressing this opinion is an expert in muscle stem cells, muscle regeneration, and muscle development.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: * * In this study, the authors sought to investigate the molecular role of Trim32, a tripartite motif-containing E3 ubiquitin ligase often associated with its dysregulation in Limb-Girdle Muscular Dystrophy Recessive 8 (LGMDR8), and its role in the dynamics of skeletal muscle differentiation. Using a CRISPR-Cas9 model of Trim32 knockout in C2C12 murine myoblasts, the authors demonstrate that loss of Trim32 alters the myogenic process, particularly by impairing the transition from proliferation to differentiation. The authors provide evidence in the way of transcriptomic profiling that displays an alteration of myogenic signaling in the Trim32 KO cells, leading to a disruption of myotube formation in-vitro. Interestingly, while previous studies have focused on Trim32's role in protein ubiquitination and degradation of c-Myc, the authors provide evidence that Trim32-regulation of c-Myc occurs at the level of mRNA stability. The authors show that the sustained c-Myc expression in Trim32 knockout cells disrupts the timely expression of key myogenic factors and interferes with critical withdrawal of myoblasts from the cell cycle required for myotube formation. Overall, the study offers a new insight into how Trim32 regulates early myogenic progression and highlights a potential therapeutic target for addressing the defects in muscular regeneration observed in LGMDR8.

      We thank the Reviewer for valuing our work and for their appreciated suggestions to improve our manuscript. We have carefully addressed some of the concerns raised as detailed here, while others, which require more laborious experimental efforts, will be addressed as reported in the Revision Plan.

      Major Comments:

      The work is a bit incremental based on this:

      https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030445 * * And this:

      https://www.nature.com/articles/s41418-018-0129-0 * * To their credit, the authors do cite the above papers.

      Authors’ response. We thank the Reviewer for this careful evaluation of our work against the current literature and for recognising the contribution of our findings to the understanding of myogenesis complex picture in which the involvement of Trim32 and c-Myc, and of the Trim32-c-Myc axis, can occur at several stages and likely in narrow time windows along the process, thus possibly explaining some reports inconsistencies.

      The authors do provide compelling evidence that Trim32 deficiency disrupts C2C12 myogenic differentiation and sustained c-Myc expression contributes to this defective process. However, while knockdown of c-Myc does restore Myogenin levels, it was not sufficient to normalize myotube morphology or differentiation index, suggesting an incomplete picture of the Trim32-dependent pathways involved. The authors should qualify their claim by emphasizing that c-Myc regulation is a major, but not exclusive, mechanism underlying the observed defects. This will prevent an overgeneralization and better align the conclusions with the author's data.

      Authors’ response. We agree with the Reviewer and we modified our phrasing that implied Trim32-c-Myc axis as the exclusive mechanism by explicitly indicated that other pathways contribute to guarantee proper myogenesis, in the Abstract and in Discussion.

      The Abstract now reads: … suggesting that the Trim32–c-Myc axis may represent an essential hub, although likely not the exclusive molecular mechanism, in muscle regeneration within LGMDR8 pathogenesis.”

      The Discussion now reads: “Functionally, we demonstrated that c-Myc contributes to the impaired myogenesis observed in Trim32 KO clones, although this is clearly not the only factor involved in the Trim32-mediated myogenic network; realistically other molecular mechanisms can participate in this process as also suggested by our transcriptomic results.”

      The authors provide a thorough and well-executed interrogation of cell cycle dynamics in Trim32 KO clones, combining phosphor-histone H3 flow cytometry of DNA content, and CFSE proliferation assays. These complementary approaches convincingly show that, while proliferation states remain similar in WT and KO cells, Trim32-deficient myoblasts fail in their normal withdraw from the cell cycle during exposure to differentiation-inducing conditions. This work adds clarity to a previously inconsistent literature and greatly strengthens the study.

      Authors’ response. We thank the Reviewer for appreciating our thorough analyses on cell cycle dynamics in proliferation conditions and at the onset of the differentiation process.

      The transcriptomic analysis (detailed In the "Transcriptomic analysis of Trim32 WT and KO clones along early differentiation" section of Results) is central to the manuscript and provides strong evidence that Trim32 deficiency disrupts normal differentiation processes. However, the description of the pathway enrichment results is highly detailed and somewhat compressed, which may make it challenging for readers to following the key biological 'take-homes'. The narrative quickly moves across their multiple analyses like MDS, clustering, heatmaps, and bubble plots without pausing to guide the reader through what each analysis contributes to the overall biological interpretation. As a result, the key findings (reduced muscle development pathways in KO cells and enrichment of cell cycle-related pathways) can feel somewhat muted. The authors may consider reorganizing this section, so the primary biological insights are highlighted and supported by each of their analyses. This would allow the biological implications to be more accessible to a broader readership.

      Authors’ response. We thank the Reviewer for raising this point and apologise for being too brief in describing the data, leaving indeed some points excessively implicit. As suggested, we now reorganised this session and added the lists of enriched canonical pathways relative to WT vs KO comparisons at D0 and D3 (Fig. EV3B) as well as those relative to the comparison between D0 and D3 for both WT and Trim32 KO samples (Fig. EV3C), with their relative scores. We changed the Results section “Transcriptomic analysis of Trim32 WT and Trim32 KO clones along early differentiationas reported here below and modified the legends accordingly.

      The paragraph now reads: Based on our initial observations, the absence of Trim32 already exerts a significant impact by day 3 (D3) of C2C12 myogenic differentiation. To investigate how Trim32 influences early global transcriptional changes during the proliferative phase (D0) and early differentiation (D3), we performed an unbiased transcriptomic profiling of WT and Trim32 KO clones (Fig. 2A). Multidimensional Scaling (MDS) analysis revealed clear segregation of gene expression profiles based on both time of differentiation (Dim1, 44% variance) and Trim32 genotype (Dim2, 16% variance) (Fig. 2A). Likewise, hierarchical clustering grouped WT and Trim32 KO clones into distinct clusters at both timepoints, indicating consistent genotype-specific transcriptional differences (Fig. EV3A). Differentially Expressed Genes (DEGs) were detected in the Trim32 KO transcriptome relative to WT, at both D0 and D3. In proliferating conditions, 72 genes were upregulated and 189 were downregulated whereas at D3 of differentiation, 72 genes were upregulated and 212 were downregulated. Ingenuity Pathway Analysis of the DEGs revealed the top 10 Canonical Pathways displayed in Fig. EV3B as enriched at either D0 or D3 (Fig. EV3B). Several of these pathways can underscore relevant Trim32-mediated functions though most of them represent generic functions not immediately attributable to the observed myogenesis defects.

      Notably, the transcriptional divergence between WT and Trim32 KO cells is more pronounced at D3, as evidenced by a greater separation along the MSD Dim2 axis, suggesting that Trim32-dependent transcriptional regulation intensifies during early differentiation (Fig. 2A). Given our interest in the differentiation process, we therefore focused our analyses comparing the changes occurring from D0 to D3 in WT (WT D3 vs. D0) and in Trim32 KO (KO D3 vs. D0) RNAseq data.

      Pathway enrichment analysis of D3 vs. D0 DEGs allowed the selection of the top-scored pathways for both WT and Trim32 KO data. We obtained 18 top-scored pathways enriched in each genotype (-log(p-value) ³ 9 cut-off): 14 are shared while 4 are top-ranked only in WT and 4 only in Trim32 KO (Fig. EV3C). For the following analyses, we employed thus a total of 22 distinct pathways and to better mine those relevant in the passage from the proliferation stage to the early differentiation one and that are affected by the lack of Trim32, we built a bubble plot comparing side-by-side the scores and enrichment of the 22 selected top-scored pathways above in WT and Trim32 KO (Fig. 2B). A heatmap of DEGs included within these selected pathways confirms the clustering of the samples considering both the genotypes and the timepoints highlighting gene expression differences (Fig. 2C). These pathways are mainly related to muscle development, cell cycle regulation, genome stability maintenance and few other metabolic cascades.

      As expected given the results related to Figure 1, moving from D0 to D3 WT clones showed robust upregulation of key transcripts associated with the Inactive Sarcomere Protein Complex, a category encompassing most genes in the “Striated Muscle Contraction” pathway, while in Trim32 KO clones this pathway was not among those enriched in the transition from D0 to D3 (Fig. EV3C). Detailed analyses of transcripts enclosed within this pathway revealed that on the transition from proliferation to differentiation, WT clones show upregulation of several Myosin Heavy Chain isoforms (e.g., MYH3, MYH6, MYH8), α-Actin 1 (ACTA1), α-Actinin 2 (ACTN2), Desmin (DES), Tropomodulin 1 (TMOD1), and Titin (TTN), a pattern consistent with previous reports, while these same transcripts were either non-detected or only modestly upregulated in Trim32 KO clones at D3 (Fig. 2D). This genotype-specific disparity was further confirmed by gene set enrichment barcode plots, which demonstrated significant enrichment of these muscle-related transcripts in WT cells (FDR_UP = 0.0062), but not in Trim32 KO cells (FDR_UP = 0.24) (Fig. EV3D). These findings support an early transcriptional basis for the impaired myogenesis previously observed in Trim32 KO cells.

      In addition to differences in muscle-specific gene expression, we observed that also several pathways related to cell proliferation and cell cycle regulation were more enriched in Trim32 KO cells compared to WT. This suggests that altered cell proliferation may contribute to the distinct differentiation behavior observed in Trim32 KO versus WT (Fig. 2B). Given that cell cycle exit is a critical prerequisite for the onset of myogenic differentiation and considering that previous studies on Trim32 role in cell cycle regulation have reported inconsistent findings, we further examined cell cycle dynamics under our experimental conditions to clarify Trim32 contribution to this process

      The work would be greatly strengthened by the conclusion of LGMDR8 primary cells, and rescue experiments of TRIM32 to explore myogenesis.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Also, EU (5-ethynyl uridine) pulse-chase experiments to label nascent and stable RNA coupled with MYC pulldowns and qPCR (or RNA-sequencing of both pools) would further enhance the claim that MYC stability is being affected.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      "On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025)." Also address and discuss the following, as what is currently written is not entirely accurate: https://www.embopress.org/doi/full/10.1038/s44319-024-00299-z and https://journals.physiology.org/doi/prev/20250724-aop/abs/10.1152/ajpcell.00528.2025

      Authors’ response. We thank the Reviewer for bringing to our attention these two publications, that indeed, add important piece of data to recapitulate the in vivo complexity of c-Myc role in myogenesis. We included this point in our Discussion.

      The Discussion now reads: “On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025). Other reports, instead, demonstrated the implication of c-Myc periodic pulses, mimicking resistance-exercise, in muscle growth, a role that cannot though be observed in our experimental model (Edman et al., 2024; Jones et al., 2025).”

      Minor Comments:

      Z-score scale used in the pathway bubble plot (Figure 2C) could benefit from alternative color choices. Current gradient is a bit muddy and clarity for the reader could be improved by more distinct color options, particularly in the transition from positive to negative Z-score.

      Authors’ response. As suggested, we modified the z-score-representing colors using a more distinct gradient especially in the positive to negative transition in Figure 2B.

      Clarification on the rationale for selecting the "top 18" pathways would be helpful, as it is not clear if this cutoff was chosen arbitrarily or reflects a specific statistical or biological threshold.

      Authors’ response. As now better explained (see comment regarding Major point: Transcriptomics), we used a cut-off of -log(p-value) above or equal to 9 for pathways enriched in DEGs of the D0 vs D3 comparison for both WT and Trim32 KO. The threshold is now included in the Results section and the pathways (shared between WT and Trim32 KO and unique) are listed as Fig. EV3C.

      The authors alternates between using "Trim 32 KO clones" and "KO clones" throughout the manuscript. Consistent terminology across figures and text would improve readability.

      Authors’ response. We thank the Reviewer for this remark, and we apologise for having overlooked it. We amended this throughout the manuscript by always using for clarity “Trim32 KO clones/cells”.

      Cell culture methodology does not specify passage number or culture duration (only "At confluence") before differentiation. This is important, as C2C12 differentiation potential can drift with extended passaging.

      Authors’ response. We agree with the Reviewer that C2C12 passaging can reduce the differentiation potential of this myoblast cell lines; this is indeed the main reason why we decided to employ WT clones, which underwent the same editing process as those that resulted mutated in the Trim32 gene, as reference controls throughout our study. We apologise for not indicating the passages in the first version of the manuscript that now is amended as per here below in the Methods section:

      The C2C12 parental cells used in this study were maintained within passages 3–8. All clonal cell lines (see below) were utilized within 10 passages following gene editing. In all experiments, WT and Trim32 KO clones of comparable passage numbers were used to ensure consistency and minimize passage-related variability.

      Reviewer #2 (Significance (Required)):

      General Assessment:

      This study provides a thorough investigation of Trim32's role the processes related to skeletal muscle differentiation using a CRISPR-Cas9 knockout C2C12 model. The strengths of this study lie in the multi-layered experimental approach as the authors incorporated transcriptomics, cell cycle profiling, and stability assays which collectively build a strong case for their hypothesis that Trim32 is a key factor in the normal regulation of myogenesis. The work is also strengthened by the use of multiple biological and technical replicates, particularly the independent KO clones which helps address potential clonal variation issues that could occur. The largest limitation to this study is that, while the c-Myc mechanism is well explored, the other Trim32-dependent pathways associated with the disruption (implicated by the incomplete rescue by c-Myc knockdown) are not as well addressed. Overall however, the study convincingly identifies a critical function for Trim32 during skeletal muscle differentiation. * * Advance: * * To my knowledge, this is the first study to demonstrate the mRNA stability level of c-Myc regulation by Trim32, rather than through the ubiquitin-mediated protein degradation. This work will advance the current understanding and provide a more complete understanding of Trim32's role in c-Myc regulation. Beyond c-Myc, this work highlights the idea that TRIM family proteins can influence RNA stability which could implicate a broader role in RNA biology and has potential for future therapeutic targeting. * * Audience: * * This research will be of interest to an audience that focuses on broad skeletal muscle biology but primarily to readers with more focused research such as myogenesis and neuromuscular disease (LGMDR8 in particular) where the defined Trim32 governance over early differentiation checkpoints will be of interest. It will also provide mechanistic insights to those outside of skeletal muscle that study TRIM family proteins, ubiquitin biology, and RNA regulation. For translational/clinical researchers, it identifies the Trim32/c-Myc axis as a potential therapeutic target for LGMDR8 and related muscular dystrophies.

      Expertise: * * My expertise lies in skeletal muscle biology, gene editing, transgenic mouse models, and bioinformatics. I feel confident evaluating the data and conclusions as presented.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      • In this paper, the authors examine the role of TRIM32, implicated in limb girdle muscular dystrophy recessive 8 (LGMDR8), in the differentiation of C2C12 mouse myoblasts. Using CRISPR, they generate mutant and wild-type clones and compare their differentiation capacity in vitro. They report that Trim32-deficient clones exhibit delayed and defective myogenic differentiation. RNA-seq analysis reveals widespread changes in gene expression, although few are validated by independent methods. Notably, Trim32 mutant cells maintain residual proliferation under differentiation conditions, apparently due to a failure to downregulate c-Myc. Translation inhibition experiments suggest that TRIM32 promotes c-Myc mRNA destabilization, but this conclusion is insufficiently substantiated. The authors also perform rescue experiments, showing that c-Myc knockdown in Trim32-deficient cells alleviates some differentiation defects. However, this rescue is not quantified, was conducted in only two of the three knockout lines, and is supported by inappropriate statistical analysis of gene expression. Overall, the manuscript in its current form has substantial weaknesses that preclude publication. Beyond statistical issues, the major concerns are: (1) exclusive reliance on the immortalized C2C12 line, with no validation in primary/satellite cells or in vivo, (2) insufficient mechanistic evidence that TRIM32 acts directly on c-Myc mRNA, and (3) overinterpretation of disease relevance in the absence of supporting patient or in vivo data. Please find more details below:*

      We thank the Reviewer for the in-depth assessment of our work and precious suggestions to improve the manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      - TRIM32 complementation / rescue experiments to exclude clonal or off-target CRISPR effects and show specificity are lacking.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - The authors link their in vitro findings to LGMDR8 pathogenesis and propose that the Trim32-c-Myc axis may serve as a central regulator of muscle regeneration in the disease. However, LGMDR8 is a complex disorder, and connecting muscle wasting in patients to differentiation assays in C2C12 cells is difficult to justify. No direct evidence is provided that the proposed mRNA mechanism operates in patient-derived samples or in mouse satellite cells. Moreover, the partial rescue achieved by c-Myc knockdown (which does not fully restore myotube morphology or differentiation index) further suggests that the disease connection is not straightforward. Validation of the TRIM32-c-Myc axis in a physiologically relevant system, such as LGMD patient myoblasts or Trim32 mutant mouse cells, would greatly strengthen the claim.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      -Some gene expression changes from the RNA-seq study in Figure 2 should be validated by qPCR

      Authors’ response. We thank the reviewer for this suggestion. This point will be addressed as detailed in the Revision Plan. We have selected several transcripts that will be evaluated in independent samples in order to validate the RNAseq results.

      - The paper shows siRNA knockdown of c-Myc in KO restores Myogenin RNA/protein but does not fully rescue myotube morphology or differentiation index. This suggests that Trim32 controls additional effectors beyond c-Myc; yet the authors do not pursue other candidate mediators identified in the RNA-seq. The manuscript would be strengthened by systematically testing whether other deregulated transcripts contribute to the phenotype.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - There are concerns with experimental/statistical issues and insufficient replicate reporting. The authors use unpaired two-tailed Student's t-test across many comparisons; multiple testing corrections or ANOVA where appropriate should be used. In Figure EV5B and Figure 6B, the authors perform statistical analyses with control values set to 1. This method masks the inherent variability between experiments and artificially augments p values. Control sample values need to be normalized to one another to have reliable statistical analysis. Myotube morphology and differentiation index quantifications need clear description of fields counted, blind analysis, and number of biological replicates.

      Authors’ response. We thank the Reviewer for raising this point.

      Regarding the replicates, we clarified in the Methods and Legends that the Trim32 KO experiments have been performed on 3 biological replicates (independent clones) and the same for the reference control (3 independent WT clones), except for the Fig. 6 experiments that were performed on 2 Trim32 KO and 2 WT clones. All the Western Blots, immunofluorescence, qPCR data are representative of the results of at least 3 independent experiments unless otherwise stated. We reported the number and type of replicates as well as the microscope fields analyzed.

      We repeated the statistical analyses of the data in Figure 5G, EV5D, EV5E, employing more appropriately the 2-way-ANOVA test, as suggested, and we now reported this info in the graphs and legends.

      We thank the Reviewer for raising this point, we agree and substituted the graphs in Fig. EV5B and 6B showing the control values normalised as suggested. The statistical analyses now reflect this change.

      -Some English mistakes require additional read-throughs. For example: "Indeed, Trim32 has no effect on the stability of c-Myc mRNA in proliferating conditions, but upon induction of differentiation the stability of c-Myc mRNA resulted enhanced in Trim32 KO clones (Fig. 5G, Fig. EV5D and 5E)."

      Authors’ response. We re-edited this revised version of the manuscript as suggested.

      -Results in Figure 5A should be quantified

      Authors’ response. We amended this point by quantifying the results shown in Fig. 5A, we added the graph of the quantification of 3 experimental replicates to the Figure. Quantification confirms that no statistically significant difference is observed. The Figure and the relative legend are modified accordingly.

      -Based on the nuclear marker p84, the separation of cytoplasmic and nuclear fractions is not ideal in Figure 5D

      Authors’ response. We agree with the Reviewer that the presence of p84 also in the cytoplasmic fraction is not ideal. Regrettably, we observed this faint p84 band in all the experiments performed. We think however, that this is not impacting on the result that clearly shows that c-Myc and Trim32 are never detected in the same compartment.

      -In Figure 6, it is not appropriate to perform statistical analyses on only two data points per condition.

      Authors’ response. We agree with the Reviewer and we now show the graph of the results of the 3 technical replicates for 2 biological replicates and do not indicate any statistics (Fig. 6B). The graph was also modified according to a previous point raised.

      -The nuclear MYOG phenotype is very interesting; could this be related to requirements of TRIM32 in fusion?

      Authors’ response. We agree with the Reviewer that Trim32 might also be necessary for myoblast fusion. This point is however beyond the scope of the present study and will be addressed in future work.

      - The hypothesis that TRIM32 destabilizes c-Myc mRNA is intriguing but requires stronger mechanistic support. This would be more convincing with RNA immunoprecipitation to test direct association with c-Myc mRNA, and/or co-immunoprecipitation to identify interactions between TRIM32 and proteins involved in mRNA stability. The study would also be strengthened by reporter assays, such as c-Myc 3′UTR luciferase constructs in WT and KO cells, to directly demonstrate 3′UTR-dependent regulation of mRNA stability.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Reviewer #3 (Significance (Required)):

      The manuscript presents a minor conceptual advance in understanding TRIM32 function in myogenic differentiation. Its main limitation is that all experiments were performed in C2C12 cells. While C2C12 are a classical system to study muscle differentiation, they are an immortalized, long-cultured, and genetically unstable line that represents a committed myoblast stage rather than bona fide satellite cells. They therefore do not fully model the biology of early regenerative responses. Several TRIM32 phenotypes reported in the literature differ between primary satellite cells and cell lines, and the authors themselves note such discrepancies. Extrapolating these findings to LGMDR8 pathogenesis without validation in primary human myoblasts, satellite cell assays, or in vivo regeneration models is therefore not justified. Previous work has already established clear roles for TRIM32 in mouse satellite cells in vivo and in patient myoblasts in vitro, whereas this study introduces a novel link to c-Myc regulation during differentiation. In addition, without mechanistic evidence, the central claim that TRIM32 regulates c-Myc mRNA stability remains descriptive and incomplete. Nevertheless, the results will be of interest to researchers studying LGMD and to those exploring TRIM32 biology in broader contexts. I review this manuscript as a muscle biologist with expertise in satellite cell biology and transcriptional regulation.

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Conceptually, I feel that the authors addressed many concerns. However, I am still not convinced that their data support the strength of their claims. Additionally, I spent considerable time investigating the now freely available code and data and found several inconsistencies that would be critical to rectify. My comments are split into two parts, reflecting concerns related to the responses/methods and concerns resulting from investigation of the provided code/data. The former is described in the public review above. Because I show several figures to illustrate some key points for the latter part, an attached file will provide the second part: https://elife-rp.msubmit.net/elife-rp_files/2025/02/24/00136468/01/136468_1_attach_15_2451_convrt.pdf

      (1) This point is discussed in more detail in the attached file, but there are some important details regarding the identification of the learned trial that require more clarification. For instance, isn’t the original criterion by Gibbon et al. (1977) the first “sequence of three out of four trials in a row with at least one response”? The authors’ provided code for the Wilcoxon signed rank test and nDkl thresholds looks for a permanent exceeding of the threshold. So, I am not yet convinced that the approaches used here and in prior papers are directly comparable.

      We agree that there remain unresolved issues with our two attempts to create criteria that match that used by Gibbon and Balsam for trials to criterion. Therefore, we have decided to remove those analyses and return to our original approach showing trials to acquisition using several different criteria so as to demonstrate that the essential feature of the results—the scaling between learning rate and information—is robust. Figure 2A shows the results for a criterion that identifies the trial after which the cumulative response rate during the CS (=cumulative CS response count from Trial 1 divided by cumulative CS time from Trial 1) is consistently above the cumulative overall response rate across the trial (i.e., including both the CS and ITI). These data compare the CS response rate with the overall response rate, rather than with ITI rate as done in the previous version (in Figure 3A of that submission), to be consistent with the subsequent comparisons that are made using the nDkl. (The nDkl relies on the comparison between the CS rate and the overall rate, rather than between the CS and ITI rates.) Figures 2B and 2C show trials to acquisition when two statistical criteria, based on the nDkl, are applied to the difference between CS and overall response rates (the criteria are for odds >= 4:1 and p<.05). As we now explain in the text, a statistical threshold is useful inasmuch as it provides some confidence to the claim that the animals had learned by a given trial. However, this trial is very likely to be after the point when they had learned because accumulating statistical evidence of a difference necessarily adds trials.

      Also, there’s still no regression line fitted to their data (Fig 3’s black line is from Fig 1,according to the legends). Accordingly, I think the claim in the second paragraph of the Discussion that the old data and their data are explained by a model with “essentially the same parameter value” is not yet convincing without actually reporting the parameters of the regression. Related to this, the regression for their data based on my analysis appears to have a slope closer to -0.6, which does not support strict timescale invariance. I think that this point should be discussed as a caveat in the manuscript.

      We now include regression lines fitted to our data in Figures 2A-C, and their slopes are reported in the figure note. We also note on page 14 of the revision that these regressions fitted to our data diverge from the black regression line (slope -1) as the informativeness increases. On pages 14-15, we offer an explanation for this divergence; that, in groups with high informativeness, the effective informativeness is likely to be lower than the assigned value because the rats had not been magazine trained which means they would not have discovered the food pellet as soon as it was released on the first few trials. On pages 15-16, we go on to note that evidence for a change in response rate during the CS in those very first few trials may have been missed because the initial response rates were very low in rats trained with very long inter-reinforcement intervals (and thus high informativeness). We also propose a solution to this problem of comparing between very low response rates, one that uses the nDkl to parse response rates into segments (clusters of trials with equivalent response rates). This analysis with parsed response rates provides evidence that differential responding to the CS may have been acquired earlier than is revealed using trial-by-trial comparisons.

      (2) The authors report in the response that the basis for the apparent gradual/multiple step-like increases after initial learning remains unclear within their framework. This would be important to point out in the actual manuscript Further, the responses indicating the fact that there are some phenomena that are not captured by the current model would be important to state in the manuscript itself.

      We have included a paragraph (on page 26) that discusses the interpretation of the steady/multi-step increase in responding across continued training.

      (3) There are several mismatches between results shown in figures and those produced by the authors’ code, or other supplementary files. As one example, rat 3 results in Fig 11 and Supplementary Materials don’t match and neither version is reproduced by the authors’ code. There are more concerns like this, which are detailed in the attached review file.

      Addressed next….

      The following is the response to the points raised in Part 2 of Reviewer 1’s pdf.

      (1a) I plotted the calculated nDkl with the provided code for rat 3 (Fig 11), but itlooks different, and the trials to acquisition also didn’t match with the table  provided (average of ~20 trial difference). The authors should revise the provided code and plots. Further, even in their provided figures, if one compares rat 3 in Supplementary Materials to data from the same rat in Fig 11, the curves are different. It is critical to have reproducible results in the manuscript, including the ability to reproduce with the provided code.

      We apologise for those inconsistencies. We have checked the code and the data in the figures to ensure they are all now consistent and match the full data in the nHT.mat file in OSF. Figures 11 and 12 from the previous version are now replaced with Figure 6 in the revised manuscript (still showing data from Rats 3 and 176). The data plotted in Fig 6 match what is plotted in the supplementary figures for those 2 rats (but with slightly different cropping of the x-axes) and all plots draw directly from nHT.mat.

      (1b) I tried to replicate also Fig 3C with the results from the provided code, but I failed especially for nDkl > 2.2. Fig 3A and B look to be OK.

      There was error in the previous Fig 3C which was plotting the data from the wrong column of the Trials2Acquisition Table. We suspect this arose because some changes to the file were not updated in Dropbox. However, that figure has changed (now Figure 2) as already mentioned, and no longer plots data obtained with that specific nDkl criterion. The figure now shows criteria that do not attempt to match the Gibbon and Balsam criterion.

      (1c) The trials to learn from the code do match with those in the  Trials2Acquisition Table, but the authors’ code doesn’t reproduce the reported trials to learn values in the nDkl Acquisition Table. The trials to learn from the code are ~20 trials different on average from the table’s ones, for 1:20, 1:100, and 1:1000 nDkl.

      We agree that discrepancies between those different files were a source of potential confusion because they were using different criteria or different ways of measuring response rate (i.e., the “conventional” calculation of rate as number of responses/time, vs our adjusted calculation in which the 1<sup>st</sup> response in the CS was excluded as well as the time spent in the magazine, vs parsed response rates based on inter-response intervals). To avoid this, there is now a single table called Acquisition_Table.xlsx in OSF that includes Trials to acquisition for each rat based on a range of criteria or estimates of response rate in labelled columns. The data shown in Figure 2 are all based on the conventional calculation of response rate (provided in Columns E to H of Acquisition_Table.xlsx). To make the source of these data explicit, we have provided in OSF the matlab code that draws the data from the nHT.mat file to obtain these values for trials-to-acquisition.

      (1d) The nDkl Acquisition Table has columns with the value of the nDkl statistics at various acquisition landmarks, but the value does not look to be true, especially for rat 19. The nDkl curve provided by the authors (Supplementary Materials) doesn’t match the values in the table. The curve is below 10 until at least 300 trials, while the table reports a value higher than 20 (24.86) at the earliest evidence of learning (~120 trials?).

      We are very grateful to the reviewer for finding this discrepancy in our previous files. The individual plots in the Supplementary Materials now contain a plot of the nDkl computed using the conventional calculation of response rate (plot 3 in each 6-panel figure) and a plot of the nDkl computed using the new adjusted calculation of response rate (plot 4). These correspond to the signed nDkl columns for each rat in the full data file nHT.mat. The nDkl values at different acquisition landmarks included in Acquisition_Table.xlsx (Cols AB to AF) correspond to the second of these nDkl formulations. We point out that, of the acquisition landmarks based on the conventional calculation of response rate (Cols E to J of Acquisition_Tabls.xlsx), only the first two landmarks (CSrate>Contextrate and min_nDkl) match the permanently positive and minimum values of the plotted nDkl values. This is because the subsequent acquisition landmarks are based on a recalculation of the nDkl starting from the trial when CSrate>ContextRate, whereas the plotted nDkl starts from Trial 1.

      (2) The cumulative number of responses during the trial (Total) in the raw data table is not measured directly, but indirectly estimated from the pre-CS period, as (cumNR_Pre*[cumITI/cumT_Pre])+ cumNR_CS (cumNR_Pre: cumulative nose-poke response number during pre-CS period; cumITI: cumulative sum of ITI duration; cumT_Pre: cumulative pre-CS duration; cumNR_CS: cumulative response number during CS), according to ‘Explanation of TbyTdataTable (MATLAB).docx’.Why not use the actual cumulative responses during the whole trial instead of using a noisier measure during a smaller time window and then scaling it for the total period?

      Unfortunately, the bespoke software used to control the experimental events and record the magazine activity did not record data continuously throughout the experiment. The ITI responses were only sampled during a specified time-window (the “pre-CS” period) immediately before each CS onset. Therefore, response counts across the whole ITI had to be extrapolated.

      (3) Regarding the “Matlab code for Find Trials to Criterion.docx”:

      (a) What’s the rationale for not using all the trials to calculate nDkl but starting the cumulative summation from the earliest evidence trial (truncated)? Also, this procedure is not described in the manuscript, and this should be mentioned.

      The procedure was perhaps not described clearly enough in the previous manuscript. We have expanded that text to make it clearer (page 12) which includes the text…

      “We started from this trial, rather than from Trial 1, because response rate data from trials prior to the point of acquisition would dilute the evidence for a statistically significant difference in responding once it had emerged, and thereby increase the number of trials required to observe significant responding to the CS. The data from Rat 1 illustrates this point. The CS response rate of Rat 1 permanently exceeded its overall response rate on Trial 52 (when the nD<sub>KL</sub> also became permanently positive). The nD<sub>KL</sub>, calculated from that trial onwards, surpassed 0.82 (odds 4:1) after a further 11 trials (on Trial 63) and reached 1.92 (p < .05) on Trial 81. By contrast, the nD<sub>KL</sub> for this rat, calculated from Trial 1, did not permanently exceed 0.82 until Trial 83 and did not exceed 1.92 until Trial 93, adding 10 or 20 trials to the point of acquisition.”

      (3b) The authors' threshold is the trial when the nDkl value exceeds the threshold permanently.  What about using just the first pass after the minimum?

      Rat 19 provides one example where the nDkl was initially positive, and even exceeded threshold for odds 4:1 and p<.05, but was followed by an extended period when the nDkl was negative because the CS response rate was less than the overall response rate. It illustrates why the first trial on which the nDkl passes a threshold cannot be used as a reliably index of acquisition.

      (3c) Can the authors explain why a value of 0.5 is added to the cumulative response number before dividing it by the cumulative time?

      This was done to provide an “unbiased” estimate of the response count because responses are integers. For example, if a rat has made 10 responses over 100 s of cumulative CS time, the estimated rate should be at least 10/100 but could be anything up to, but not including, 11/100. A rate of 10.5/100 is the unbiased estimate. However, we have now removed this step when calculating the nDkl to identify trials to acquisition because we recognise that it would represent a larger correction to the rate calculated across short intervals than across long intervals and therefore bias comparison between CS and overall response rates that involve very different time durations. As such, the correction would artefactually inflate evidence that the CS response rate was higher than the contextual response rate. However, as noted earlier in this reply, we have now instituted a similar correction when calculating the pre-CS response rate over the final 5 sessions for rats that did not register a single response (hence we set their response count to 0.5).

      (3d) Although the authors explain that nDkl was set to negative if pre-CS rate is higher than CS rate, this is not included in the code because the code calculates the nDkl using the truncated version, starting to accumulate the poke numbers and time from the earliest evidence, thus cumulative CS rate is always higher than cumulative contextual rate. I expect then that the cumulative CS rate will be always higher than the cumulative pre-CS rate.

      Yes, that is correct. The negative sign is added to the nDkl when it is computed starting from Trial 1. But when it is computed starting from the trial when the CS rate is permanently > the overall rate, there is no need to add a sign because the divergence is always in the positive direction.

      (3e) Regarding the Wilcoxon signed rank test, please clarify in the manuscript that the input ‘rate’ is not the cumulative rate as used for the earliest evidence. Please also clarify if the rates being compared for the signed nDkl are just the instantaneous rates or the cumulative ones. I believe that these are the ‘cumulative’ ones (not as for Wilcoxon signed rank test), because if not, the signed nDkl curve of rat 3 would fluctuate a lot across the x-axis.

      The reviewer is correct in both cases. However, as already mentioned, we have removed the analysis involving the Wilcoxon test. The description of the nDkl already specifies that this was done using the cumulative rates.

      (4) Supplemental table ‘nDkl Acquisition Table.xlsx’ 3rd column (“Earliest”) descriptions are unclear.

      (a) It is described in the supplemental ‘Explanation of Excel Tables.docx’ as the ‘earliest estimate of the onset of a poke rate during the CSs higher than the contextual poke rate’, while the last paragraph of the manuscript’s method section says ‘Columns 4, 5 and 6 of the table give the trial after which conditioned responding appeared as estimated in the above described three different ways— by the location of the minimum in the nDkl, the last upward 0 crossings, and the CS parse consistently greater than the ITI parse, respectively. Column 3 in that table gives the minimum of the three estimates.’ I plotted the data from column 3 (right) and comparing them with Fig 3A (left) makes it clear that there’s an issue in this column. If the description in the ‘Explanation of Excel Tables.docx’ is incorrect, please update it.

      We agree that the naming of these criteria can cause confusion, hence we have changed them. On page 9 we have replaced “earliest” with “first” in describing the criterion plotted in Figure 2A showing the trial starting from which the cumulative CS response rate permanently exceeded the cumulative overall rate. What is labelled as “Earliest” in “Acquisition_Table.xlsx” is, as the explanation says, the minimum value across the 3 estimates in that table.

      (b) Also, the term ‘contextual poke rate’ in the 3rd column’s description isconfusing as in the nDkl calculation it represents the poke rate during all the training time, while in the first paragraph of the ‘Data analysis’ part, the earliest evidence is calculated by comparing the ITI (pre-CS baseline) poke rate.

      Yes, we have kept the term “contextual” response rate to refer to responding across the whole training interval (the ITI and the CS duration). This is used in calculation of the nDkl. For consistency with this comparison, we now take the first estimate of acquisition (in Fig 2A) based on a comparison between the CS rate and the overall (context) rate (not the pre-CS rate).

      Reviewer #2 (Recommendations for the authors):

      In response to the Rebuttal comments:

      Analytical (1) relating to Figure 3C/D

      This is a reasonable set of alternative analyses, but it is not clear that it answers the original comment regarding why the fit was worse when using a theoretically derived measure. Indeed, Figure 3C now looks distinctly different to the original Gibbon and Balsam data in terms of the shape of the relationship (specifically, the Group Median - filled orange circles) diverge from the black regression line.

      As mentioned in response to Reviewer 1, there was a mistake in Figure 3C of the revised manuscript. The figure was actually plotting data using a more stringent criterion of nDkl > 5.4, corresponding to p<0.001. The figure was referencing the data in column J of the public Trials2Acquisition Table. The data previously plotted in Figure 3C are no longer plotted because we no longer attempt to identify a criterion exactly matching that used by Gibbon and Balsam.

      We agree that the data shown in the first 3 panels of Figure 2 do diverge somewhat from the black regression line at the highest levels of informativeness (C/T ratios > 70), and the regression lines fitted to the data have slopes greater than -1. We acknowledge this on page 14 of the revised manuscript. Since Gibbon and Balsam did not report data from groups with such high ratios, we can’t know whether their data too would have diverged from the regression line at this point. We now report in the text a regression fitted to the first 10 groups in our experiment, which have C/T ratios that coincide with those of Gibbon and Balsam, and those regression lines do have slopes much closer to -1 (and include -1 in the 95% confidence intervals). We believe the divergence in our data at the high C/T ratios may be due to the fact that our rats were not given magazine training before commencing training with the CS and food. Because of this, it is quite likely that many rats did not find the food immediately after delivery on the first few trials. Indeed, in subsequent experiments, when we have continued to record magazine entries after CS-offset, we have found that rats can take 90 s or more to enter the magazine after the first pellet delivery. This delay would substantially increase the effective CS-US interval, measured from CS onset to discovery of the food pellet by the rat, making the CS much less informative over those trials. We now make this point on pages 14-15 of the revised manuscript.

      Analytical (2)

      We may have very different views on the statistical and scientific approaches here.

      This scalar relationship may only be uniquely applicable to the specific parameters of an experiment where CS and US responding are measured with the same behavioral response (magazine entry). As such, statements regarding the simplicity of the number of parameters in the model may simply reflect the niche experimental conditions required to generate data to fit the original hypotheses.

      To the extent that our data are consistent with the data reported decades ago by Gibbon and Balsam indicates the scalar relationship they identified is not unique to certain niche conditions since those special conditions must be true of both the acquisition of sign-tracking responses in pigeons and magazine entry responses in rats. How broadly it applies will require further experimental work using different paradigms and different species to assess how the rate of acquisition is affected across a wide range of informativeness, just as we have done here.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):           

      Summary:

      The authors have created a new model of KCNC1-related DEE in which a pathogenic patient variant (A421V) is knocked into a mouse in order to better understand the mechanisms through which KCNC1 variants lead to DEE.  

      Strengths:

      (1)  The creation of a new DEE model of KCNC1 dysfunction. 

      (2)  In Vivo phenotyping demonstrates key features of the model such as early lethality and several types of electrographic seizures. 

      (3)  The ex vivo cellular electrophysiology is very strong and comprehensive including isolated patches to accurately measure K+ currents, paired recording to measure evoked synaptic transmission, and the measurement of membrane excitability at different time points and in two cell types.

      We thank Reviewer 1 for these positive comments related to strengths of the study.   

      Weaknesses:

      (1) The assertion that membrane trafficking is impaired by this variant could be bolstered by additional data.

      We agree with this comment. However, given the technical challenges of standard biochemical experiments for investigating voltage-gated potassium channels (e.g., antibody quality), the lack of a Kv3.1-A421V specific antibody, and the fact that Kv3.1 is expressed in only a small subset of cells, we did not undertake this approach. However, we did perform additional experiments and analysis to improve the rigor of the experiments supporting our conclusion that membrane trafficking is impaired in the Kcnc1-A421V/+ mouse. 

      Such experiments support a highly significant and robust difference in our (albeit imperfect) measurement of the membrane:cytosol ratio of Kv3.1 immunofluorescence between WT and Kcnc1-A421V/+ mice, which is consistent with lack of membrane trafficking (Figure 3). In the revised manuscript, we have added additional data points to this plot and updated the representative example images using improved imaging techniques to better showcase how Kcnc1-A421V/+ PV-INs differ from age-matched WT littermate controls. We think the result is quite clear. Future biochemical experiments perhaps best performed in a culture system in vitro could provide additional support for this conclusion.

      (2) In some experiments details such as the age of the mice or cortical layer are emphasized, but in others, these details are omitted.

      We apologize for this omission. We have now clarified the age of the mice and cortical layer for each experiment in the Methods and Results sections as well as figure legends.   

      (3) The impairments in PV neuron AP firing are quite large. This could be expected to lead to changes in PV neuron activity outside of the hypersynchronous discharges that could be detected in the 2-photon imaging experiments, however, a lack of an effect on PV neuron activity is only loosely alluded to in the text. A more formal analysis is lacking. An important question in trying to understand mechanisms underlying channelopathies like KCNC1 is how changes in membrane excitability recorded at the whole cell level manifest during ongoing activity in vivo. Thus, the significance of this work would be greatly improved if it could address this question.

      Yes, the impairments in the neocortical PV-IN excitability are notably severe relative to other PV interneuronopathies that we and others have directly investigated (e.g., Kv3.1 or Kv3.2-/- knockout mice; Scn1a+/- mice). In the revised version of the manuscript, we have now added a more thorough in vivo 2P calcium imaging investigation and analysis of our in vivo 2P calcium imaging data of PV-IN (and presumptive excitatory cell) neural activity (Figure 8 and Supplementary Figure 9, Methods- lines 230-271 Results- lines 630-657, and Discussion lines- 795-814). 

      Because of the prominent recruitment of neuropil during presumptive myoclonic seizures, further investigation of individual neuronal excitability in vivo required a slightly different labeling strategy now using a soma-tagged GCaMP8m as well as a separate AAV containing tdTomato driven by the PV-IN-specific S5E2 enhancer. Our new results reveal an increase in the baseline calcium transient frequency in non-PV-INs, and reduced mean transient amplitudes in both non-PV cells and PV-INs. These interesting findings, which are consistent with attenuated PV-IN-mediated perisomatic inhibition leading to disinhibited excitatory cells in the Kcnc1-A421V/+ mice, link our in vivo results to the slice electrophysiology experiments. Of course, there are residual issues with the application of this technique to interneurons and the ability to resolve individual or small numbers of spikes, which likely explains the lack of genotype difference in calcium transient frequency in PV-INs.

      (4) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice, but there is no mention of littermate control analyzed by EEG. 

      We performed additional experiments as requested and did not observe myoclonic jerks or any other epileptic activity in WT control mice. We have included this data in the revised manuscript (Figure 9C).   

      Reviewer #2 (Public review):           

      Summary:

      Wengert et al. generated and thoroughly characterized the developmental epileptic encephalopathy phenotype of Kcnc1A421V/+ knock-in mice. The Kcnc1 gene encodes the Kv3.1 channel subunit. Analogous to the role of BK channels in excitatory neurons, Kv3 channels are important for the recurrent high-frequency discharge in interneurons by accelerating the downward hyperpolarization of the individual action potential. Various Kcnc1 mutations are associated with developmental epileptic encephalopathy, but the effect of a recurrent A421V mutation was somewhat controversial and its influence on neuronal excitability has not been fully established. In order to determine the neurological deficits and underlying disease mechanisms, the authors generated cre-dependent KI mice and characterized them using neonatal neurological examination, high-quality in vitro electrophysiology, and in vivo imaging/electrophysiology analyses. These analyses revealed excitability defects in the PV+ inhibitory neurons associated with the emergence of epilepsy and premature death. Overall, the experimental data convincingly support the conclusion.

      Strengths:

      The study is well-designed and conducted at high quality. The use of the Cre-dependent KI mouse is effective for maintaining the mutant mouse line with premature death phenotype, and may also minimize the drift of phenotypes which can occur due to the use of mutant mice with minor phenotype for breeding. The neonatal behavior analysis is thoroughly conducted, and the in vitro electrophysiology studies are of high quality.

      We appreciate these positive comments from Reviewer 2. 

      Weaknesses:

      While not critically influencing the conclusion of the study, there are several concerns.

      In some experiments, the age of the animal in each experiment is not clearly stated. For example, the experiments in Figure 2 demonstrate impaired K+ conductance and membrane localization, but it is not clear whether they correlated with the excitability and synaptic defects shown in subsequent figures. Similarly, it is unclear how old mice the authors conducted EEG recordings, and whether non-epileptic mice are younger than those with seizures. 

      We have now updated the manuscript to include clear report of age for all experiments including the impaired K<sup>+</sup> conductance (now Figure 3) and EEG (now Figure 9). There was no intention to omit this information. The recordings of K<sup>+</sup> conductance impairments in PV-INs from Kcnc1-A421V/+ mice were completed at P1621. Thus, we interpret the loss of potassium current density to be causally linked with the impairments in intrinsic physiological function at that same time-period in neocortical layer II-IV PV-INs and more subtly in PV-positive cells in the RTN and neocortical layer V PVINs.

      Mice used in the EEG experiments were P24-48, an age range which roughly corresponded with the midpoint on the survival curve for Kcnc1-A421V/+ mice. Although we saw significant mouse-to-mouse variability in seizure phenotype, no Kcnc1-A421V/+ mice completely lacked epilepsy or marked epileptiform abnormalities, neither of which were seen in WT mice. We did not detect a clear relationship between seizure frequency/type and mouse age. 

      The trafficking defect of mutant Kv3.1 proposed in this study is based only on the fluorescence density analysis which showed a minor change in membrane/cytosol ratio. It is not very clear how the membrane component was determined (any control staining?). In addition to fluorescence imaging, an addition of biochemical analysis will make the conclusion more convincing (while it might be challenging if the Kv3.1 is expressed only in PV+ cells).

      This relates to comment 3 of Reviewer 1. We agree that, in the initial submission of the manuscript, the evidence from IHC for Kv3.1 trafficking deficits was somewhat subtle. In the revised version of the paper, we have gathered additional replicates of this original experiment with improved imaging quality and clarify how the membrane component was specified, to now show a robust and highly significant (***P<0.001) decrease in membrane:cytosol Kv3.1 ratio. We have also now provided new example images better showcasing the deficits observed in the Kcnc1-A421V/+ mice (Figure 3). The membrane compartment was defined as the outermost 1 micron of the parvalbumin-defined cell soma (drawn blind to the Kv3.1b signal), and, importantly, all analysis was conducted blinded to mouse genotype. These measures help to ensure that the result is robust and unbiased. Nonetheless, we have added a paragraph in the Discussion section highlighting the limitations of our IHC evidence for trafficking impairment (Lines 868-883). 

      While the study focused on the superficial layer because Kv3.1 is the major channel subunit, the PV+ cells in the deeper cortical layer also express Kv3.1 (Chow et al., 1999) and they may also contribute to the hyperexcitable phenotype via negative effect on Kv3.2; the mutant Kv3.1 may also block membrane trafficking of Kv3.1/Kv3.2 heteromers in the deeper layer PV cells and reduce their excitability. Such an additional effect on Kv3.2, if present, may explain why the heterozygous A421V KI mouse shows a more severe phenotype than the Kv3.1 KO mouse (and why they are more similar to Kv3.2 KO). Analyzing the membrane excitability differences in the deep-layer PV cells may address this possibility.

      We appreciate this thoughtful suggestion. We have now provided data from neocortical layer V PV interneurons in the revised manuscript (Supplementary Figure 5). Abnormalities in intrinsic excitability from neocortical layer V PV-INs in Kcnc1A421V/+ mice were present, but less pronounced than in PV-INs from more superficial cortical layers. These results are consistent with the view that greater relative expression of Kv3.2 “dilutes” the impact of the Kv3.1 A421V/+ variant. More specific determination of whether the A421V/+ variant impairs membrane trafficking and/or gating of Kv3.2 remains unclear. 

      We attempted to assess how the mutant Kv3.1 affects Kv3.2 localization, but were unsuccessful due to the lack of reliable antibodies. After immunostaining mouse brain sections with two different anti-Kv3.2 antibodies, only one produced somewhat promising signal (see below). However, even in this case, Kv3.2 staining was successful only once (out of five independent staining experiments) and the signal varied across cortical regions, showing widespread cellular Kv3.2 signal in some areas (b, top panel), and barely detectable signal in others, regardless of Kv3.1 expression. In the remaining four attempts, we detected only ‘fiber-like’ immunostaining signal, further diminishing our confidence in anti-Kv3.2 antibody, although results could be improved with still further testing and refinement which we will attempt. Consequently, this important question remains unsolved in this study. 

      Author response image 1.

      Immunostaining of Kv3.1 and Kv3.2 in sagittal mouse brain sections. a) An example of intracellular Kv3.2 immunostaining signal, variable across the cortex of a WT mice independent of Kv3.1 expression b) Kv3.2 is detectable intracellularly in most of the cells in the top panel but barely detectable in the lowest panel. c) Representative image of Kv3.2 immunostaining signal in other sagittal mouse brain sections.

      We have discussed these important implications and limitations of our results in the Discussion (Lines 868-883). We agree with the Reviewer’s interpretation that an impact on Kv3.1/Kv3.2 heteromultimers across the neocortex may explain why the Kcnc1A421V/+ mouse exhibits a more severe phenotype than Kv3.1-/- or Kv3.2-/- mice (see below), a view which we have attempted to further clarify in the Conclusion.    

      In Table 1, the A421V PV+ cells show a depolarized resting membrane potential than WT by ~5 mV which seems a robust change and would influence the circuit excitability. The authors measured firing frequency after adjusting the membrane voltage to -65mV, but are the excitability differences less significant if the resting potential is not adjusted? It is also interesting that such a membrane potential difference is not detected in young adult mice (Table 2). This loss of potential compensation may be important for developmental changes in the circuit excitability. These issues can be more explicitly discussed.

      We do not entirely understand this finding and its apparent developmental component. It could be compensatory, as suggested by the Reviewer; however, it is transient and seems to be an isolated finding (i.e., it is not accompanied by compensation in other properties). It is also possible that this change in Kcnc1-A421V/+ PV-INs may reflect impaired/delayed development. We cannot test excitability at a meaningfully later time point as the mice are deceased.

      The revised version of the manuscript contains additional data (Supplementary Figure 4) showing that major deficits in intrinsic excitability are still observed even when the resting membrane potential is left unadjusted. These results are further discussed in the Results section (lines 522-523) and the Discussion section (lines 727-731).   

      Reviewer #3 (Public review):           

      Summary:

      Here Wengert et al., establish a rodent model of KCNC1 (Kv3.1) epilepsy by introducing the A421V mutation. The authors perform video-EEG, slice electrophysiology, and in vivo 2P imaging of calcium activity to establish disease mechanisms involving impairment in the excitability of fast-spiking parvalbumin (PV) interneurons in the cortex and thalamic PV cells.

      Outside-out nucleated patch recordings were used to evaluate the biophysical consequence of the A421V mutation on potassium currents and showed a clear reduction in potassium currents. Similarly, action potential generation in cortical PV interneurons was severely reduced. Given that both potassium currents and action potential generation were found to be unaffected in excitatory pyramidal cells in the cortex the authors propose that loss of inhibition leads to hyperexcitability and seizure susceptibility in a mechanism similar to that of Dravet Syndrome.  

      Strengths: 

      This manuscript establishes a new rodent model of KCNC1-developmental and epileptic encephalopathy. The manuscript provides strong evidence that parvabumin-type interneurons are impaired by the A421V Kv3.1 mutation and that cortical excitatory neurons are not impaired. Together these findings support the conclusion that seizure phenotypes are caused by reduced cortical inhibition.

      We thank Reviewer 3 for their view of the strengths of the study.

      Weaknesses:

      The manuscript identifies a partial mechanism of disease that leaves several aspects unresolved including the possible role of the observed impairments in thalamic neurons in the seizure mechanism. Similarly, while the authors identify a reduction in potassium currents and a reduction in PV cell surface expression of Kv3.1 it is not clear why these impairments would lead to a more severe disease phenotype than other loss-of-function mutations which have been characterized previously. Lastly, additional analysis of videoEEG data would be helpful for interpreting the extent of the seizure burden and the nature of the seizure types caused by the mutation.

      We agree with this comment(s) from Reviewer 3. We studied neurons in the reticular thalamus and layer V neocortical PV-INs since they are also linked to epilepsy pathogenesis and are known to express Kv3.1. However, for most of the study, we focused on neocortical layer II-IV PV-INs, because these cells exhibited the most robust impairments in intrinsic excitability. Cross of our novel Kcnc1-Flox(A421V)/+ mice to a cerebral cortex interneuron-specific driver that would avoid recombination in the thalamus, such as Ppp1r2-Cre (RRID:IMSR_JAX:012686), could assist in determining the relative contribution of thalamic reticular nucleus dysfunction to overall phenotype as used by (Makinson et al., 2017) to address a similar question; however, we have been unable to obtain this mouse despite extensive effort. There are of course other Kv3.1expressing neurons in the brain, including in the hippocampus, amygdala, and cerebellum, and we have provided additional discussion (Lines 731-736) of this issue.

      We further agree with the Reviewer that a major question in the field of KCNC1-related neurological disorders is the mechanistic underpinning of why the KCNC1-A421V variant leads to a more severe disease phenotype than other loss of function KCNC1 variants, and, further, why the mouse phenotype is more severe than the Kcnc1 knockout. Previous results and our own recordings in heterologous systems suggest that the A421V variant is more profoundly loss of function than the R320H variant (Oliver et al., 2017; Cameron et al., 2019; Park et al., 2019), which is consistent with A421V having a more severe disease phenotype. Relative to knockout of Kv3.1, our results are consistent with the view that the A421V exhibits dominant negative activity by reducing surface expression of Kv3.1 and/or Kv3.2 (an effect that would not occur in knockout mice), with a possible additional contribution of impairing gating of those Kv3.1-A421V variant containing Kv3.1/Kv3.2 heteromultimers by inclusion of A421V subunits into the heterotetramer. Our finding that the magnitude of total potassium current was reduced in PV-INs by ~50% is consistent with a combination of these various mechanisms but does not distinguish between them.

      In the revised version of the manuscript, we have provided a more complete discussion of these important remaining questions regarding our interpretation of how the severity of KCNC1 disorders relates to the biophysical features of the ion channel variant (lines 868883).

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):          

      Major

      (1) The authors suggest that the reduced K+ current density in Kcnc1-A421V/+ neurons is due in part to impaired trafficking and cell surface expression of Kv3.1 in these neurons. The data supporting this claim aren't completely convincing. First, it's difficult to visualize a difference in Kv3.1 localization in the images shown in panel H, and importantly, it seems problematic that the method to assess Kv3.1 levels in membrane vs. cytosol relied on using PV co-staining to define the membrane compartment as the outermost 1 um of the PV-defined cell soma. This doesn't seem to be the best method to define the membrane compartment, as the PV signal should be largely cytosolic.

      As noted above, we have completed additional data collection to confirm our results, and have performed additional imaging and updated our example images to be more representative of the observed deficits in membrane Kv3.1 expression in the Kcnc1-A421V/+ mice. We attempted to identify a marker to more clearly label the membrane to combine with PV immunocytochemistry but were unable to do so despite some effort. 

      Is it possible that in control neurons, the cytosolic PV signal localizes within the membrane-bound Kv3.1 signal, with less colocalization, whereas in Kcnc1-A421V/+ neurons, there would be more colocalization of the cytosolic PV and improperly trafficked Kv3.1.? Could the data be presented in this way showing altered colocalization of Kv3.1 with PV?

      We do not entirely understand the nature of this concern. In our experiments, we utilized the PV signal to determine the cell membrane and cytosolic compartments in an unbiased manner using a 1-micron shell traced around/outside the edge of the PV signal to define the membrane compartment, with the remainder of the area (minus the nuclear signal defined by DAPI) defined as the cytosol (see Methods 176-186). Because we did not identify any alterations in PV signal or correlation between PV immunohistochemistry and tdTomato expression in Cre reporter strains between WT and Kcnc1-A421V/+ mice, we believe that our strategy for determining membrane:cytosol ratio of Kv3.1 in an unbiased manner is acceptable (albeit of course imperfect). 

      Alternatively, membrane fractionation could be performed on WT vs Kcnc1-A421V/+ neurons, followed by Western blotting with a Kv3.1 antibody to show altered proportions in the cytosolic vs. membrane protein fractions. It's important that these results are convincing, as the findings are mentioned in the Abstract, the Results section, and multiple times in the Discussion, although it is still unclear how much the potential altered trafficking contributes to the decrease in K+ currents versus changes in channel gating.

      Multiple technical barriers made it difficult for us to gain direct biochemical evidence for altered trafficking of the A421V/+ Kv3.1 variant (see above). It is not clear how membrane fractionation techniques could be easily applied in this case (at least by us) when PV-INs constitute 3-5% of all neocortical neurons. We further agree (as noted above) that it is difficult to properly disentangle the relative roles of impaired membrane trafficking vs. gating deficits to the observed effect; however, we think that both phenomena are likely occurring. In the revised version of the manuscript, we have more explicitly discussed these limitations in the Discussion section (Lines 868-883).   

      (2) More information is needed regarding the age of mice used for experiments for the following results (added to the Results section as well as figure legends):

      PV density (Supplementary Figure 1) 

      K+ current data (Figure 2A-G)       

      Kv3.1 localization (Figure 2H and I)        

      RTN electrophysiology (Supplementary Figure 3)

      Excitatory neuron electrophysiology (Figure 4)             

      In vivo 2P calcium imaging (Figure 7) 

      Video-EEG (Figure 8)

      We apologize for omitting this critical information. In the revised manuscript, we have provided the age of mice for each of our experiments in the results section, in the figure legend, and in the methods section.   

      (3) It's unclear why developmental milestones/behavioral assessments were only done at P5-P10. In the previous publication of another Kcnc1 LOF variant (Feng et al. 2024), no differences were found at P5-P10, and it was suggested in the discussion that this finding was "consistent with the known developmental expression pattern of Kv3.1 in mouse, where Kv3.1 protein does not appear until P10 or later". In that paper, they did find behavioral deficits at 2-4 months. Even though this model is more severe than the previous model, it would be interesting to determine if there are any behavioral deficits at a later time point (especially as they find more neurophysiological impairments at P32P42).

      As in our previous study, the lack of clear behavioral deficits in developmental milestones from P5-15 is potentially expected considering the developmental expression of Kv3.1, and we performed these experiments primarily to showcase that the Kcnc1-A421V/+ mice exhibit otherwise normal overall early development (although this could be an artifact of the sensitivity of our testing methods).

      For the revised manuscript, we have conducted additional experiments to investigate behavioral deficits in adult Kcnc1-A421V/+ mice. We found cognitive/learning deficits in both Kcnc1-A421V/+ mice relative to WT in both the Barnes maze (Figure 2A-C) and Ymaze (Figure 2D-F). Other aspects of animal behavior including cerebellar-related motor function are likely also impaired at post-weaning timepoints, and will be included in a forthcoming research study focusing on the motor function in these mice.  

      (4) In the Results section, it should be more clearly stated which cortical layer/layers are being studied. In some cases, it mentions layers 2-4, and in some, only layer 4, and in others, it doesn't mention layers at all. Toward the beginning of the Results section, the rationale for focusing on layers 2-4 to assess the effects of this variant should be well described and then, for each experiment, it should be stated which cortical layers were assessed. Related to this point, it seems electrophysiology was only done in layer 4; the rationale for this should also be included.

      We have now clarified which neocortical layers were under investigation in the study. All PV-INs were targeted in somatosensory layers II-IV, while excitatory neurons were either cortical layer IV spiny stellate cells or pyramidal cells. Paired recordings were also completed in layer IV. We have also more explicitly articulated our rationale for looking at PV-INs in layers II-IV to examine the cellular/circuitlevel impact of Kv3.1 in a model of developmental and epileptic encephalopathy (Lines 487-491). 

      (5) Kcnc1-A421V/+ PV neurons showed more robust impairments in AP shape and firing at P32-42 than at P16-21 (Figure 3), and only showed synaptic neurotransmission alterations at P32-42 (Figure 6). Thus, it's unclear why Kcnc1-A421V/+ excitatory neurons were only assessed at P16-21 (Figure 4 and Supplementary Figure 4 related to Figure 5), particularly if only secondary or indirect effects on this population would be expected.

      We appreciate this excellent point raised by the Reviewer and we have taken the suggestion to examine excitatory neurons at P32-42 in addition to the earlier juvenile timepoint. Our new results from the later timepoint are similar to our results at P16-21: Excitatory neurons show no statistically significant impairments in intrinsic excitability at either of the two timepoints examined (Supplementary Figure 7). This adds support to our original conclusion that PV-INs represent the major driver of disease pathology across development.   

      (6) The 2P calcium imaging experiments are potentially interesting, however, a relationship between these results and the electrophysiology results for PV neurons is lacking. Was there an attempt to assess the frequency and/or amplitude of calcium events specifically in PV neurons, outside of the hypersynchronous discharges, to determine whether there are differences between WT and Kcnc1-A421V/+, as was seen in the electrophysiological analyses? It does seem there are some key differences between the two experiments (age: later timepoint for 2P vs. P16-21 and P32-42, layer: 2/3 vs. 4, and PV marking method: virus vs. mouse line), but the electrophysiological differences reported were quite strong. Thus, it would be surprising if there were no alterations in calcium activity among the Kcnc1-A421V/+ PV neurons.

      In our initial experiments, the prominent neuropil GCaMP signal in Kcnc1-A421V/+ mice rendered it difficult to distinguish and accurately describe baseline neuronal excitability in PV-INs and non-PV cells. In our revised manuscript, we utilized a soma-tagged GCaMP8m and separately labeled PV-INs through S5E2-tdTomato. This strategy made it possible to assess the amplitude and frequency of calcium transients in both PV-positive and PV-negative cells in vivo. We have updated the description of our methods (lines 230-271) and our results (lines 630-657) in the revised manuscript.

      As noted above, our more detailed analysis of somatic calcium transients in PV-IN and non-PV cells during quiet rest (Figure 8 and Supplementary Figure 9) shows that PV-INs from Kcnc1-A421V/+ mice are abnormally excitable- having reduced transient amplitude relative to WT controls. Interestingly, non-PV cells also exhibited an increased calcium transient frequency and reduced amplitude which is potentially consistent with reduced perisomatic inhibition causing disinhibition in cortical microcircuits. We again highlight that the slow kinetics of GCaMP combined with the calcium buffering and brief spikes of PVINs render quantification of action potential frequency and comparisons between groups difficult.  

      (7) As mentioned above, it would be helpful to state the time points or age ranges of these experiments to better understand the results and relate them to each other. For example, the 2P imaging showed apparent myoclonic seizures in 7/7 Kcnc1-A421V/+ mice (recorded for a total of 30-50 minutes/mouse), but the video-EEG showed myoclonic seizures in only 3/11 Kcnc1-A421V/+ mice (recorded for 48-72 hours/mouse). Were these experiments done at very different age ranges, so this difference could be due to some sort of progression of seizure types and events as the mice age? Is it possible these are not the same seizure types (even though they are similarly described)? This discrepancy should be discussed.

      Mice in the EEG experiments were between the ages of P24 and 48, slightly younger than the age in which we carried out the in vivo calcium imaging experiments (>P50). Therefore, an age-related exacerbation in myoclonic jerks is possible. 

      As is highlighted by the Reviewer, it is interesting that the myoclonic seizures were only detected in a portion of the Kcnc1-A421V/+ mice during EEG monitoring (4/12). We believe that the difference is most likely driven by more sensitive detection of the myoclonic jerk activity and behavior in the 2P imaging of neuropil cellular activity compared to our video-EEG monitoring and 2P imaging of soma-tagged GCaMP. We have occasionally observed repetitive myoclonic jerking in mice that appears highly localized (i.e. one forepaw only) suggesting that the myoclonic seizures exist on a spectra of severity from focal to diffuse. It is therefore possible that myoclonic events and electrographic activity may be slightly underestimated in our video-EEG experiments? 

      We have now added a few lines discussing this discrepancy in the Discussion (lines 809814).   

      (8) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice. Was video-EEG performed on control mice? These data should be added to Figure 8.

      We have added recordings in control WT mice (N=4). We did not detect myoclonic jerks or other epileptiform activity in the control mice (Figure 9).  

      Minor

      (1) In the first Results section, Line 365, the P value (P<0.001) is different from that in the legend for Figure 1, line 743 (P<0.0001).

      We have fixed this discrepancy. 

      (2) For Supplementary Figure 1, it would be helpful to show images that span the cortical layers (1-6), as PV and Kv3.1 are both expressed across the cortical layers.

      We have updated Supplementary Figure 1 with better example images that span the cortical layers.    

      (3) Error bars should be added to the line graphs in Supplementary Figure 2, particularly panels B and C. Some of the differences appear small considering the highly significant p-values (i.e. body weight at P7 and brain weight at P21).

      The values shown in Supplementary Figure 2D-E are percentages of mice displaying a particular characteristic, so there is no variance for the data.

      Supplementary Figure 2B-C actually do contain error bars plotted as SEM, however, because of the large number of N and small degree of variance in the measurements, the error bars are not apparent in the graphs. This has been noted in the Supplementary Figure 2 legend for clarity. 

      (4) In Figure 3, although the Kcnc1-A421V/+ neurons have elevated AP amplitudes relative to WT, the representative traces for P16-21 and P32-42 groups appear strikingly opposite (traces in B in G appear to have much higher amplitudes than those in C and H). As this is one of the three AP phenotypes described, it would be nice to have it reflected in the traces.

      We have updated our example traces to better represent our main findings including AP amplitude for both P16-21 and P32-42 timepoints.  

      (5) Were any effects on the AHP assessed in the electrophysiology experiments? As other studies have reported the effects of altered Kv3 channel activity on AHP, this parameter could be interesting to report as well.

      We have now provided data on the afterhyperpolarization for each condition displayed in the Supplementary data tables. Interestingly, we failed to detect significant differences in AHP between WT and Kcnc1-A421V/+ PV-INs, RTN neurons, or pyramidal cells, although we did identify differences in the dV/dt of the repolarization phase of the AP.   

      (6) The figure legend for Figure 7 has errors in the panel labeling (D instead of C, and two Fs).

      This error has been corrected in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      Specific comments and questions for the authors:         

      (1) Do the authors provide a reason for why the juvenile animals are unaffected by the A421V mutation? Is it that PV cells have not fully integrated at this early time point or that Kv3.1 expression is low? Is the developmental expression profile of Kv3.1 in PV cells known and if so could the authors update the discussion with this information?

      We interpret the normal early developmental milestones (P5-P15) to reflect that Kcnc1-A421V/+ mice exhibit the onset of their neurological impairment at the same time that PV-INs upregulate Kv3.1, develop a fast-spiking physiological phenotype, and integrate into functional circuits in the third and fourth postnatal weeks. We have updated the discussion (Line 780-782) with this information and more clearly describe our interpretation of these early-life behavioral experiments.   

      (2) I would like to see a more complete analysis of the Video-EEG data that is included in Figure 8. What was the seizure duration and frequency? Were there spike-wave seizure types observed? Were EEG events that involve thalamocortical circuitry affected such as spindles? Was sleep architecture impaired in the model? Were littermate control animals recorded?

      Although classical convulsive seizures represent only part of the overall epilepsy phenotype that this mouse exhibits, we agree that reporting seizure duration and frequency is important. We have now included this in our revised manuscript (line 624-626). We have also now added WT control mice to our dataset, and, as expected, we failed to observe any epileptic features in our WT recordings.

      In our EEG experiments, we did not record EMG activity in the mouse to allow for unambiguous determination of sleep vs. quiet wakefulness. For that reason, and because we believe it beyond the scope of this particular study, we did not examine sleep-related EEG phenomena such as spindles or sleep architecture. We have, however, added a line in the discussion (line 771-774) suggesting that future studies focus on a more thorough investigation of the EEG activity in these animals. 

      (3) The in vivo calcium imaging data shows synchronous bursts in A421V animals which is in agreement with the synchronous bursts observed in the EEG. Overall the analysis of the in vivo calcium imaging data appears to be rudimentary and perhaps this is a missed opportunity. What additional insights were gained from this technically demanding experiment that were not obtained from the EEG recordings?

      As noted above, in the revised version of the manuscript, we have conducted additional experiments which allowed us to separately examine PV-IN and non-PV neuron excitability via 2P in vivo calcium imaging. This required an alternative strategy to label individual neuronal somata without contamination by the robust neuropil signal that we observed in the approach undertaken in the original submission. We’ve described the details of this new approach in methods (Lines 230-271) and results section (lines 630-657).

      Our new results (Figure 8 and Supplementary Figure 9) reveal that, during quiet rest, neocortical PV-INs from Kcnc1-A421V/+ mice exhibit a reduction in calcium transient amplitude during quiet wakefulness and that non-PV cells exhibit altered transient frequency and amplitude. Overall, we believe that these results are consistent with the view that PV-IN-mediated perisomatic inhibition is compromised in Kcnc1-A421V/+ mice which leads to a downstream hyperexcitability in excitatory neurons within cortical microcircuits.  

      (4) The increased severity of seizure phenotypes observed in the A421V model relative to knockout mice is interesting but also confusing given what is known about this mutation. As the authors point out, a possible explanation is that the mutation is acting in a dominant negative manner, where mutant Kv3.1 channels compete with other Kvs that would otherwise be able to partially compensate for the loss of Kv function. Alternatively, the A421V mutation might act by affecting the trafficking of heterotetrameric Kv3 channels to the membrane. Can the authors clarify why a trafficking deficit would produce a different effect than a loss of function mutation? Are the authors proposing that a hypomorphic mutation involving both a partial trafficking deficit and a dominant negative effect of those channels that are properly localized is more severe than a "clean" loss of function? The roughly 50% loss of potassium current absent a change in gating would be expected to behave like a loss-of-function mutation. This might be addressed by comparing the surface expression of the other Kv channels and/or through the use of Kv3.1-selective pharmacology.

      These are excellent points raised by the Reviewer. As noted above, we have endeavored to clarify our hypothesis as to the basis of this phenomenon, although the mechanistic basis for the more severe phenotype in the Kcnc1-A421V/+ mouse relative to the Kv3.1 knockout is not entirely clear. Our physiology results and the evidence presented supporting a trafficking impairment, are consistent with dominant negative action of the Kv3.1 A421V variant at the level of channel gating and/or trafficking. To restate, we think the Kcnc1-A421V/+ heterozygous variant is more severe than a Kv3.1 knockout for (at least) three reasons: variant Kv3.1 is incorporated into Kv3.1/Kv3.2 heterotetramers to (1) impair trafficking to the membrane as well as (2) alter the electrophysiological function of those channels that do successfully traffic to the membrane (while Kv3.1 knockout affects Kv3.1 only), and (3) the heterozygous variant may escape compensatory upregulation of Kv3.2 and which is known to occur in Kv3.1 knockout mice.

      For example, our data suggests and is consistent with the view that heterotetramers of WT Kv3.1 and Kv3.2 potentially come together with the A421V Kv3.1 subunit in the endoplasmic reticulum and then fail to traffic to the membrane due to the presence of one or more A421V subunit(s), as evidenced by increased Kv3.1 staining in the cytosol in the Kcnc1-A421V/+ mouse relative to WT. This is in contrast to what would occur in the Kv3.1knockout mice as there is no subunit produced from the null allele to impair WT Kv3.2 subunits from forming fully functional Kv3.2 homotetramers to then reach the cell surface and function properly. This is one specific possible mechanism for dominant negative activity.

      A non-mutually-exclusive mechanism is that inclusion of one or more Kv3.1 A421V subunits into Kv3 heterotetramers impairs gating and prevents potassium flux such that, even if the tetramer does reach the membrane, that entire tetramer fails to contribute to the total potassium current. This is another possible mechanism for dominant negative function of the A421V subunit.

      Experimental elucidation of the precise mechanism of the dominant negative activity of the A421V Kcnc1 variant is beyond the scope of this study; yet, our lab is continuing to work on this. It will likely require dose-response experiments in which various ratios of WT and Kv3.1 A421V subunits are co-expressed in heterologous cells and then recorded for an overall effect on potassium current similar to (Clatot et al., 2017).

      In the revised manuscript, we have updated our discussion of these mechanistic considerations for KCNC1-related epilepsy syndromes in lines 868-883 in the Discussion. 

      References

      Cameron JM et al. (2019) Encephalopathies with KCNC1 variants: genotype-phenotypefunctional correlations. Annals of Clinical and Translational Neurology 6:1263– 1272.

      Clatot J, Hoshi M, Wan X, Liu H, Jain A, Shinlapawittayatorn K, Marionneau C, Ficker E, Ha T, Deschênes I (2017) Voltage-gated sodium channels assemble and gate as dimers. Nature Communications 8.

      Makinson CD, Tanaka BS, Sorokin JM, Wong JC, Christian CA, Goldin AL, Escayg A, Huguenard JR (2017) Regulation of Thalamic and Cortical Network Synchrony by Scn8a. Neuron 93:1165-1179.e6.

      Oliver KL et al. (2017) Myoclonus epilepsy and ataxia due to KCNC1 mutation: Analysis of 20 cases and K+ channel properties. Annals of Neurology 81.

      Park J et al. (2019) KCNC1-related disorders: new de novo variants expand the phenotypic spectrum. Annals of Clinical and Translational Neurology 6:1319–1326.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) A detailed comparison between this work and the work of Sun et al. on experimental protocols and reagents in the main text will be beneficial for readers to assess critically.

      We have added a Key Reagents Table outlining the key reagents used in our study. In terms of experimental protocols, we replicated those described by Sun et al. in most instances and described any differences when present. With this resubmission, we included additional ZnMP accumulation experiments in liquid media (see point 3 below).

      (2) The GaPP used by Sun et al. (purchased from Frontier Scientific) is more effective in killing the worm than the one used in this study (purchased from Santa Cruz). Is the different outcome due to the differences in reagents? Moreover, Sun et al. examined the lethality after 3-4 days, while this work examined the lethality after 72 hours. Would the extra 24 hours make any difference in the result?

      We now cite product vender differences as a possible reason for the observed difference in worm death, as the reviewer suggests, on page 8 (see text below) and include these differences in the Key Reagents Table. We also now stress the fact that our experiments included different doses of GaPP and the use of eat-2 mutants as an additional control, which we believe adds rigor and demonstrates the potency of GaPP in our experiments. We decided on assessment at 72 hours, as we deemed it a less nebulous time point as compared to 3-4 days. Most of the observed worm death occurred earlier in this interval, so we believe it is unlikely that large group differences would emerge after an additional 24 hours.

      “Exposing worms to GaPP, a toxic heme analog, we observed that nematodes deficient in HRG-9 and HRG-10 displayed increased survival compared to WT worms, consistent with prior work,[13] though the between-group difference was markedly smaller in our study. We required higher GaPP concentrations to induce lethality, potentially due to product vendor differences, but did observe a clear dose-dependent effect across strains. Although it was previously proposed that the survival benefit seen in worms lacking HRG-9 and HRG-10 resulted from reduced transfer from intestinal cells after GaPP ingestion, our data suggest the reduced lethality is more likely due to decreased environmental GaPP uptake. Supporting this notion, DKO worms exhibited lawn avoidance, reduced pharyngeal pumping, and modestly lower intestinal ZnMP accumulation when exposed to this fluorescent heme analog on agar plates. In liquid media, DKO worms demonstrated higher fluorescence, but only in ZnMP-free conditions, suggesting the presence of gut granule autofluorescence. Furthermore, survival following exposure to GaPP was highest in eat-2 mutants, despite heme trafficking being unaffected in this strain.”

      (3) This work reported the opposite result of Sun et al. for the fluorescent ZnMP accumulation assay. However, the experimental protocols used by the two studies are massively different. Sun et al. did the ZnMP staining by incubating the L4-stage worms in an axenic mCeHR2 medium containing 40 μM ZnMP (purchased from Frontier Scientific) and 4 μM heme at 20 ℃ for 16 h, while this work placed the L4-stage worms on the OP50 E. coli seeded NGM plates treated with 40 μM ZnMP (purchased from Santa Cruz) for 16 h. The liquid axenic mCeHR2 medium is bacteria-free, heme-free, and consistent for ZnMP uptake by worms. This work has mentioned that the hrg-9 hrg-10 double null mutant has bacterial lawn avoidance and reduced pharyngeal pumping phenotypes. Therefore, the ZnMP staining protocol used in this work faces challenges in the environmental control for the wild type vs. the mutant. The authors should adopt the ZnMP staining protocol used by Sun et al. for a proper evaluation of fluorescent ZnMP accumulation.

      We agree with this comment. As such, we performed the ZnMP assay in liquid media conditions, as now described on page 13:

      “For liquid media experiments, three generations of worms were cultured in regular heme (20 uM) axenic media, with the first two generations receiving antibiotic-supplemented media (10 mg/ml tetracycline) and the 3<sup>rd</sup> generation cultivated without antibiotic. L4 worms from the 3<sup>rd</sup> generation were placed in media containing 40uM ZnMP for 16 hours before being prepared and mounted for imaging as above. Worms were imaged on Zeiss Axio Imager 2 at 40x magnification, with image settings kept uniform across all images. Fluorescent intensity was measured within the proximal region of the intestine using ImageJ.”

      In heme-free media, both WT and DKO worms invariably entered L1 arrest, thus we were not able to replicate the results reported by Sun et al. Using media containing heme, we did see an increase in fluorescence, but this was only in the ZnMP-free condition, indicating that the increased signal was attributable to autofluorescence. This is a known phenomenon associated with gut granules in C. elegans in the setting of oxidative stress. The results of these experiments are now summarized on page 6:

      “DKO nematodes at the L4 larval stage were previously shown to accumulate the fluorescent heme analog zinc mesoporphyrin IX (ZnMP) in intestinal cells in low-heme (4 µM) liquid media. While attempting to replicate this experiment, we observed that both wildtype and DKO nematodes entered L1 arrest under these conditions. Therefore, to allow for developmental progression, we grew worms on standard OP50 E. coli plates and in media containing physiological levels of heme (20 µM). We then examined whether differences in ZnMP uptake persisted under these basal conditions. DKO worms grown on ZnMP-treated E. coli plates displayed significantly reduced intestinal ZnMP fluorescence compared to N2 (Figure 1B and C). Using basal heme media with ZnMP, there was no significant difference in ZnMP fluorescence between DKO and wildtype nematodes, although DKO worms grown in media without ZnMP exhibited significantly higher autofluorescence (Figure 1D and E). To test whether autofluorescence may have contributed to the higher fluorescent intensities previously reported in heme-deficient DKO worms, we repeated this experiment on agar plates under starved conditions but did not observe a difference between groups (Figure 1B).”

      (4) A striking difference between the two studies is that Sun et al. emphasize the biochemical function of TANGO2 homologs in heme transporting with evidence from some biochemical tests. In contrast, this work emphasizes the physiological function of TANGO2 homologs with evidence from multiple phenotypical observations. In the discussion part, the authors should address whether these observed phenotypes in this study can be due to the loss of heme transporting activities upon eliminating TANGO2 homologs. This action can improve the merit of academic debate and collaboration.

      Thank you for this suggestion. The following text has been added to the Discussion section (page 9):

      “In addition to altered pharyngeal pumping, DKO worms displayed multiple previously unreported phenotypic features, suggesting a broader metabolic impairment and reminiscent of some clinical manifestations observed in patients with TDD. Elucidating the mechanisms underlying this phenotype, and whether they reflect a core bioenergetic defect, is an active area of investigation in our lab. Several C. elegans heme-responsive genes have been characterized, revealing relatively specific defects in heme uptake or utilization rather than broad organismal dysfunction. For example, hrg-1 and hrg-4 mutants exhibit impaired growth only under heme-limited conditions,[23] and hrg-3 loss affects brood size and embryonic viability specifically when maternal heme is scarce.[24] ]By contrast, hrg-9 and hrg-10 mutants exhibit the most severe organismal phenotypes of the hrg family, to date, including reduced pharyngeal pumping, decreased motility, shortened lifespan, and smaller broods, even when fed a heme-replete diet.”

      Reviewer #2 (Public review):

      (1) The manuscript is written mainly as a criticism of a previously published paper. Although reproducibility in science is an issue that needs to be acknowledged, a manuscript should focus on the new data and the experiments that can better prove and strengthen the new claims.

      Thank you for this suggestion. While the primary intent of this study was to replicate key findings from the 2022 publication by Sun et al., the revised manuscript now emphasizes underlying mechanisms more broadly rather than focusing narrowly on that prior publication.

      (2) The current presentation of the logic of the study and its results does not help the authors deliver their message, although they possess great potential.

      We have attempted to rectify this through substantial revision of the Discussion section and other places throughout the manuscript.

      (3) The study is missing experiments to link hrg-9 and hrg-10 more directly to bioenergetic and oxidative stress pathways.

      The reviewer is correct in this assertion, but it was not our intent to definitively prove this link or, indeed, the primary mechanism of TANGO2 in the present manuscript. This said, we are actively engaged in this endeavor in our lab and anticipate these data will be published in a separate, forthcoming publication.

      We have added additional references pertaining to hrg-9 enrichment as part of the mitochondrial unfolded protein response (page 10) and a comparison of the phenotype observed in hrg-9 and hrg-10 deficient worms versus those lacking other proteins in the hrg family (page 9).

      Reviewer #3 (Public review):

      (1) The authors stress - with evidence provided in this paper or indicated in the literature - that the primary role of TANGO2 and its homologues is unlikely to be related to heme trafficking, arguing that observed effects on heme transport are instead downstream consequences of aberrant cellular metabolism. But in light of a mounting body of evidence (referenced by the authors) connecting more or less directly TANGO2 to heme trafficking and mobilization, it is recommended that the authors comment on how they think TANGO2 could relate to and be essential for heme trafficking, albeit in a secondary, moonlighting capacity. This would highlight a seemingly common theme in emerging key players in intracellular heme trafficking, as it appears to be the case for GAPDH - with accumulating evidence of this glycolytic enzyme being critical for heme delivery to several downstream proteins.

      TANGO2 is essential for mitochondrial health, albeit in a yet unknown capacity. In the absence of TANGO2, defects in heme trafficking may be secondary sequelae of mitochondrial dysfunction. We would point out that prior studies that attempted to show that TANGO2 and its homologs are involved in heme trafficking proposed very different mechanisms (direct binding vs. membrane protein interaction) and relied on artificially low or high heme conditions to produce these effects. We have attempted to address these more clearly in the Discussion section and have added a fifth figure to summarize our current unifying theory for how heme levels and mitochondrial stress may be linked.

      (2) The observation - using eat-2 mutants and lawn avoidance behaviour - that survival patterns can be partially explained by reduced consumption, is fascinating. It would be interesting to quantify the two relative contributions.

      We have completed additional ZnMP experiments in liquid media at the reviewers’ request. This experimental condition eliminates lawn avoidance as a factor in consumption. Fluorescent intensity was significantly higher in the DKO worms in media lacking ZnMP, indicating increased autofluorescence in DKO worms, while signal was not significantly different in media with ZnMP.

      (3) In the legend to Figure 1A it's a bit unclear what the differently coloured dots represent for each condition. Repeated measurements, worms, independent experiments? The authors should clarify this.

      The following sentence has been added to the legend for Figure 1:

      “Each dot represents the number of offspring laid by one adult worm on one GaPP-treated plate after 24 hours.”

      (4) It would help if the entire fluorescence images (raw and processed) for the ZnMP treatments were provided. Fluorescence images would also benefit Figure 1B.

      Fluorescent intensity values pertaining to the ZnMP experiments are included in our Extended Data supplement, and we have added representative images to Figure 1, per the reviewer’s request. We thank the reviewer for this helpful suggestion. We would be happy to upload raw images to an open-access repository if deemed necessary by the editorial team.

      (5) Increasingly, the understanding of heme-dependent roles relies on transient or indirect binding to unsuspected partners, not necessarily relying on a tight affinity and outdating the notion of heme as a static cofactor. Despite impressive recent advancements in the detection of these interactions (for example https://doi.org/10.1021/jacs.2c06104; cited by the authors), a full characterisation of the hemome is still elusive. Sandkuhler et al. deemed it possible but seem to question that heme binding to TANGO2 occurs. However, Sun et al. convincingly showed and characterised TANGO2 binding to heme. It is recommended that the authors comment on this.

      We believe it is plausible that TANGO2 binds heme (as do hundreds of other proteins), especially as it has been shown to bind other hydrophobic molecules. However, we also note that a separate paper examining the role of TANGO2 in heme transport posited that GAPDH is the sole heme binding partner for cytoplasmic transport (https://doi.org/10.1038/s41467-025-62819-2), contradicting the originally posited theory of how TANGO2 functions. This is described in the Discussion section and, as noted above, we have added an additional figure to demonstrate our unifying hypothesis for why TANGO2 may be important in the low-heme state, irrespective of any direct effect on heme trafficking.

      Additional comments and revisions:

      (1) It was suggested that a triple mutant (eat-2; hrg-9; hrg-10) be tested to determine the primary driver of GaPP toxicity. We appreciate this suggestion, but we offer the following rationale for why these experiments were not pursued. The eat-2 mutant, which lacks a nicotinic acetylcholine receptor subunit in pharyngeal muscles, was included solely as a dietary restriction control to illustrate that reduced GaPP toxicity in the hrg-9/10 double mutant could arise from poor feeding rather than defective heme transport. Both eat-2 and hrg-9/10 mutants exhibit markedly reduced feeding but via different mechanisms. In our assays, GaPP survival was inversely correlated with ingestion rate: eat-2 animals, which feed the least, showed the highest survival, while hrg-9/10 mutants showed intermediate feeding and intermediate survival. Consistent with this, eat-2 worms also displayed the lowest ZnMP accumulation.

      (2) GaPP solution was added to NGM plates after seeding with OP50. This is now expressly stated in the Methods section (page 15). We would note that Sun et al. mixed GaPP in with NGM in the liquid phase. We would expect that if there were a difference in GaPP exposure due to these different protocols, worms in our experiment would have received higher GaPP concentrations.

      “Standard NGM plates were treated with 1, 2, 5, or 10 µM gallium protoporphyrin IX (GaPP; Santa Cruz) after seeding with OP50. Plates were swirled to ensure an even distribution of GaPP and allowed to dry completely.

      (3) The manuscript has been reworked to read as more of an independent study rather than a rebuttal of prior work, though the primary objective of validating prior work remains unchanged.

      (4) Several technical details of experiments have been moved from the main text to the materials and methods section.

      (5) One reviewer noted that the figure numbering should be adjusted. Numbering does not progress sequentially (i.e., 1A…1B…2A…2B) early in the text, because we have opted to consolidate data pertaining to heme analog experiments in Figure 1 and behavioral data in Figure 2.

      (6) “Kingdoms” has been changed to “domains” (page 4).

      (7) Example images are now included for Figure 1B, as noted above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This study introduces an important approach using selection linked integration (SLI) to generate Plasmodium falciparum lines expressing single, specific surface adhesins PfEMP1 variants, enabling precise study of PfEMP1 trafficking, receptor binding, and cytoadhesion. By moving the system to different parasite strains and introducing an advanced SLI2 system for additional genomic edits, this work provides compelling evidence for an innovative and rigorous platform to explore PfEMP1 biology and identify novel proteins essential for malaria pathogenesis including immune evasion.

      Reviewer #1 (Public review):

      One of the roadblocks in PfEMP1 research has been the challenges in manipulating var genes to incorporate markers to allow the transport of this protein to be tracked and to investigate the interactions taking place within the infected erythrocyte. In addition, the ability of Plasmodium falciparum to switch to different PfEMP1 variants during in vitro culture has complicated studies due to parasite populations drifting from the original (manipulated) var gene expression. Cronshagen et al have provided a useful system with which they demonstrate the ability to integrate a selectable drug marker into several different var genes that allows the PfEMP1 variant expression to be 'fixed'. This on its own represents a useful addition to the molecular toolbox and the range of var genes that have been modified suggests that the system will have broad application. As well as incorporating a selectable marker, the authors have also used selective linked integration (SLI) to introduce markers to track the transport of PfEMP1, investigate the route of transport, and probe interactions with PfEMP1 proteins in the infected host cell.

      What I particularly like about this paper is that the authors have not only put together what appears to be a largely robust system for further functional studies, but they have used it to produce a range of interesting findings including:

      Co-activation of rif and var genes when in a head-to-head orientation.

      The reduced control of expression of var genes in the 3D7-MEED parasite line.

      More support for the PTEX transport route for PfEMP1.

      Identification of new proteins involved in PfEMP1 interactions in the infected erythrocyte, including some required for cytoadherence.

      In most cases the experimental evidence is straightforward, and the data support the conclusions strongly. The authors have been very careful in the depth of their investigation, and where unexpected results have been obtained, they have looked carefully at why these have occurred.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      (1) In terms of incorporating a drug marker to drive mono-variant expression, the authors show that they can manipulate a range of var genes in two parasite lines (3D7 and IT4), producing around 90% expression of the targeted PfEMP1. Removal of drug selection produces the expected 'drift' in variant types being expressed. The exceptions to this are the 3D7-MEED line, which looks to be an interesting starting point to understand why this variant appears to have impaired mutually exclusive var gene expression and the EPCR-binding IT4var19 line. This latter finding was unexpected and the modified construct required several rounds of panning to produce parasites expressing the targeted PfEMP1 and bind to EPCR. The authors identified a PTP3 deficiency as the cause of the lack of PfEMP1 expression, which is an interesting finding in itself but potentially worrying for future studies. What was not clear was whether the selected IT4var19 line retained specific PfEMP1 expression once receptor panning was removed.

      We do not have systematic long-term data for the Var19 line but do have medium-term data. After panning the Var19 line, the binding assays were done within 3 months without additional panning. The first binding assay was 2 months after the panning and the last binding assays three weeks later, totaling about 3 months without panning. While there is inherent variation in these assays that precludes detection of smaller changes, the last assay showed the highest level of binding, giving no indication for rapid loss of the binding phenotype. Hence, we can say that the binding phenotype appears to be stable for many weeks without panning the cells again and there was no indication for a rapid loss of binding in these parasites.

      Systematic long-term experiments to assess how long the Var19 parasites retain binding would be interesting, but given that the binding-phenotype appears to remain stable over many weeks or even months, this would only make sense if done over a much longer time frame. Such data might arise if the line is used over extended times for a specific project in which case it might be advisable to monitor continued binding. We included a statement in the discussion that the binding phenotype was stable over many weeks but that if long-term work with this line is planned, monitoring the binding phenotype might be advisable: “In the course of this work the binding phenotype of the IT4var19 expressor line remained stable over many weeks without further panning. However, given that initial panning had been needed for this particular line, it might be advisable for future studies to monitor the binding phenotype if the line is used for experiments requiring extended periods of cultivation.”

      (2) The transport studies using the mDHFR constructs were quite complicated to understand but were explained very clearly in the text with good logical reasoning.

      We are aware of this being a complex issue and are glad this was nevertheless understandable.

      (3) By introducing a second SLI system, the authors have been able to alter other genes thought to be involved in PfEMP1 biology, particularly transport. An example of this is the inactivation of PTP1, which causes a loss of binding to CD36 and ICAM-1. It would have been helpful to have more insight into the interpretation of the IFAs as the anti-SBP1 staining in Figure 5D (PTP-TGD) looks similar to that shown in Figure 1C, which has PTP intact. The anti-EXP2 results are clearly different.

      We realize the description of the PTP1-TGD IFA data and that of the other TGDs (see also response to Recommendation to authors point 4 and reviewer 2, major points 6 and 7) was rather cursory. The previously reported PTP1 phenotype is a fragmentation of the Maurer’s clefts into what in IFA appear to be many smaller pieces (Rug et al 2014, referenced in the manuscript). The control in Fig. 5D has 13 Maurer’s cleft spots (previous work indicates an average of ~15 MC per parasite, see e.g. the originally co-submitted eLife preprint doi.org/10.7554/eLife.103633.1 and references therein). The control mentioned by the reviewer in Fig. 1C has about 22 Maurer’s clefts foci, at the upper end of the typical range, but not unusual. In contrast, the PTP1-TGD in Fig. 5D, has more than 30 foci with an additional cytoplasmic pool and additional smaller, difficult to count foci. This is consistent with the published phenotype in Rug et al 2014. The EXP1 stained cell has more than 40 Maurer’s cleft foci, again beyond what typically is observed in controls. Therefore, these cells show a difference to the control in Fig. 5 but also to Fig. 1C. Please note that we are looking at two different strains, in Fig. 1 it is 3D7 and in Fig. 5 IT4. While we did not systematically assess this, the Maurer’s clefts number per cell seemed to be largely comparable between these strains (Fig. 10C and D in the other eLife preprint doi.org/10.7554/eLife.103633.1). 

      Overall, as the PTP1 loss phenotype has already been reported, we did not go into more experimental detail. However, we now modified the text to more clearly describe how the phenotype in the PTP1-TGD parasites was different to control: “IFAs showed that in the PTP1-TGD parasites, SBP1 and PfEMP1 were found in many small foci in the host cell that exceeded the average number of ~ 15 Maurer’s clefts typically found per infected RBC [66] (Fig. 5D). This phenotype resembled the previously reported Maurer’s clefts phenotype of the PTP1 knock out in CS2 parasites [39].”

      (4) It is good to see the validation of PfEMP1 expression includes binding to several relevant receptors. The data presented use CHO-GFP as a negative control, which is relevant, but it would have been good to also see the use of receptor mAbs to indicate specific adhesion patterns. The CHO system if fine for expression validation studies, but due to the high levels of receptor expression on these cells, moving to the use of microvascular endothelial cells would be advisable. This may explain the unexpected ICAM-1 binding seen with the panned IT4var19 line.

      We agree with the reviewer that it is desirable to have better binding systems for studying individual binding interactions. As the main purpose of this paper was to introduce the system and provide proof of principle that the cells show binding, we did not move to more complicated binding systems. However, we would like to point out that the CSA binding was done on receptor alone in addition to the CSA-expressing HBEC-5i cells and was competed successfully with soluble CSA. In addition, apart from the additional ICAM1-binding of the Var19 line, all binding phenotypes were conform with expectations. We therefore hope the tools used for binding studies are acceptable at this stage of introducing the system while future work interested in specific PfEMP1 receptor interactions may use better systems, tailored to the specific question (e.g. endothelial organoid models and engineered human capillaries and inhibitory antibodies or relevant recombinant domains for competition).

      (5) The proxiome work is very interesting and has identified new leads for proteins interacting with PfEMP1, as well as suggesting that KAHRP is not one of these. The reduced expression seen with BirA* in position 3 is a little concerning but there appears to be sufficient expression to allow interactions to be identified with this construct. The quantitative impact of reduced expression for proxiome experiments will clearly require further work to define it.

      This is a valid point. Clearly there seems to be some impact on binding when BirA* is placed in the extracellular domain (either through reduced presentation or direct reduction of binding efficiency of the modified PfEMP1; please see also minor comment 10 reviewer 2). The exact quantitative impact on the proxiome is difficult to assess but we note that the relative enrichment of hits to each other is rather similar to the other two positions (Fig. 6H-J). We therefore believe the BioIDs with the 3 PfEMP1-BirA* constructs are sufficient to provide a general coverage of proteins proximal to PfEMP1 and hope this will aid in the identification of further proteins involved in PfEMP1 transport and surface display as illustrated with two of the hits targeted here.

      The impact of placing a domain on the extracellular region of PfEMP1 will have to be further evaluated if needed in other studies. But the finding that a large folded domain can be placed into this part at all, even if binding was reduced, in our opinion is a success (it was not foreseeable whether any such change would be tolerated at all).

      (6) The reduced receptor binding results from the TryThrA and EMPIC3 knockouts were very interesting, particularly as both still display PfEMP1 on the surface of the infected erythrocyte. While care needs to be taken in cross-referencing adhesion work in P. berghei and whether the machinery truly is functionally orthologous, it is a fair point to make in the discussion. The suggestion that interacting proteins may influence the "correct presentation of PfEMP1" is intriguing and I look forward to further work on this.

      We hope future work will be able to shed light on this.

      Overall, the authors have produced a useful and reasonably robust system to support functional studies on PfEMP1, which may provide a platform for future studies manipulating the domain content in the exon 1 portion of var genes. They have used this system to produce a range of interesting findings and to support its use by the research community. Finally, a small concern. Being able to select specific var gene switches using drug markers could provide some useful starting points to understand how switching happens in P. falciparum. However, our trypanosome colleagues might remind us that forcing switches may show us some mechanisms but perhaps not all.

      Point noted! From non-systematic data with the Var01 line that has been cultured for extended periods of time (several years), it seems other non-targeted vars remain silent in our SLI “activation” lines but how much SLI-based var-expression “fixing” tampers with the integrity of natural switching mechanisms is indeed very difficult to gage at this stage. We now added a statement to the discussion that even if mutually exclusive expression is maintained, it is not certain the mechanisms controlling var expression all remain intact: “However, it should be noted that it is not known whether all mechanisms controlling mutually exclusive expression and switching remain intact in parasites with SLI-activated var genes.”

      Reviewer #2 (Public review):

      Summary

      Croshagen et al develop a range of tools based on selection-linked integration (SLI) to study PfEMP1 function in P. falciparum. PfEMP1 is encoded by a family of ~60 var genes subject to mutually exclusive expression. Switching expression between different family members can modify the binding properties of the infected erythrocyte while avoiding the adaptive immune response. Although critical to parasite survival and Malaria disease pathology, PfEMP1 proteins are difficult to study owing to their large size and variable expression between parasites within the same population. The SLI approach previously developed by this group for genetic modification of P. falciparum is employed here to selectively and stably activate the expression of target var genes at the population level. Using this strategy, the binding properties of specific PfEMP1 variants were measured for several distinct var genes with a novel semi-automated pipeline to increase throughput and reduce bias. Activation of similar var genes in both the common lab strain 3D7 and the cytoadhesion competent FCR3/IT4 strain revealed higher binding for several PfEMP1 IT4 variants with distinct receptors, indicating this strain provides a superior background for studying PfEMP1 binding. SLI also enables modifications to target var gene products to study PfEMP1 trafficking and identify interacting partners by proximity-labeling proteomics, revealing two novel exported proteins required for cytoadherence. Overall, the data demonstrate a range of SLI-based approaches for studying PfEMP1 that will be broadly useful for understanding the basis for cytoadhesion and parasite virulence.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Comments

      (1) While the capability of SLI to actively select var gene expression was initially reported by Omelianczyk et al., the present study greatly expands the utility of this approach. Several distinct var genes are activated in two different P. falciparum strains and shown to modify the binding properties of infected RBCs to distinct endothelial receptors; development of SLI2 enables multiple SLI modifications in the same parasite line; SLI is used to modify target var genes to study PfEMP1 trafficking and determine PfEMP1 interactomes with BioID. Curiously, Omelianczyk et al activated a single var (Pf3D7_0421300) and observed elevated expression of an adjacent var arranged in a head-to-tail manner, possibly resulting from local chromatin modifications enabling expression of the neighboring gene. In contrast, the present study observed activation of neighboring genes with head-to-head but not head-totail arrangement, which may be the result of shared promoter regions. The reason for these differing results is unclear although it should be noted that the two studies examined different var loci.

      The point that we are looking at different loci is very valid and we realize this is not mentioned in the discussion. We now added to the discussion that it is unclear if our results and those cited may be generalized and that different var gene loci may respond differently

      “However, it is unclear if this can be generalized and it is possible that different var loci respond differently.”

      (2) The IT4var19 panned line that became binding-competent showed increased expression of both paralogs of ptp3 (as well as a phista and gbp), suggesting that overexpression of PTP3 may improve PfEMP1 display and binding. Interestingly, IT4 appears to be the only known P. falciparum strain (only available in PlasmoDB) that encodes more than one ptp3 gene (PfIT_140083100 and PfIT_140084700). PfIT_140084700 is almost identical to the 3D7 PTP3 (except for a ~120 residue insertion in 3D7 beginning at residue 400). In contrast, while the C-terminal region of PfIT_140083100 shows near-perfect conservation with 3D7 PTP3 beginning at residue 450, the N-terminal regions between the PEXEL and residue 450 are quite different. This may indicate the generally stronger receptor binding observed in IT4 relative to 3D7 results from increased PTP3 activity due to multiple isoforms or that specialized trafficking machinery exists for some PfEMP1 proteins.

      We thank the reviewer for pointing this out, the exact differences between the two PTP3s of IT4 and that of other strains definitely should be closely examined if the function of these proteins in PfEMP1 binding is analysed in more detail. 

      It is an interesting idea that the PTP3 duplication could be a reason for the superior binding of IT4. We always assumed that IT4 had better binding because it was less culture adapted but this does not preclude that PTP3(s) is(are) a reason for this. However, at least in our 3D7 PTP3 can’t be the reason for the poor binding, as our 3D7 still has PfEMP1 on the surface while in the unpanned IT4-Var19 line and in the Maier et al., Cell 2008 ptp3 KO (PMID: 18614010)) PfEMP1 is not on the surface anymore. 

      Testing the impact of having two PTP3s would be interesting, but given the “mosaic” similarity of the two PTP3s isoforms, a simple add-on experiment might not be informative. Nevertheless, it will be interesting in future work to explore this in more detail.

      Reviewer #3 (Public review):

      Summary:

      The submission from Cronshagen and colleagues describes the application of a previously described method (selection linked integration) to the systematic study of PfEMP1 trafficking in the human malaria parasite Plasmodium falciparum. PfEMP1 is the primary virulence factor and surface antigen of infected red blood cells and is therefore a major focus of research into malaria pathogenesis. Since the discovery of the var gene family that encodes PfEMP1 in the late 1990s, there have been multiple hypotheses for how the protein is trafficked to the infected cell surface, crossing multiple membranes along the way. One difficulty in studying this process is the large size of the var gene family and the propensity of the parasites to switch which var gene is expressed, thus preventing straightforward gene modification-based strategies for tagging the expressed PfEMP1. Here the authors solve this problem by forcing the expression of a targeted var gene by fusing the PfEMP1 coding region with a drug-selectable marker separated by a skip peptide. This enabled them to generate relatively homogenous populations of parasites all expressing tagged (or otherwise modified) forms of PfEMP1 suitable for study. They then applied this method to study various aspects of PfEMP1 trafficking.

      Strengths:

      The study is very thorough, and the data are well presented. The authors used SLI to target multiple var genes, thus demonstrating the robustness of their strategy. They then perform experiments to investigate possible trafficking through PTEX, they knock out proteins thought to be involved in PfEMP1 trafficking and observe defects in cytoadherence, and they perform proximity labeling to further identify proteins potentially involved in PfEMP1 export. These are independent and complimentary approaches that together tell a very compelling story.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Weaknesses:

      (1)  When the authors targeted IT4var19, they were successful in transcriptionally activating the gene, however, they did not initially obtain cytoadherent parasites. To observe binding to ICAM-1 and EPCR, they had to perform selection using panning. This is an interesting observation and potentially provides insights into PfEMP1 surface display, folding, etc. However, it also raises questions about other instances in which cytoadherence was not observed. Would panning of these other lines have been successfully selected for cytoadherent infected cells? Did the authors attempt panning of their 3D7 lines? Given that these parasites do export PfEMP1 to the infected cell surface (Figure 1D), it is possible that panning would similarly rescue binding. Likewise, the authors knocked out PTP1, TryThrA, and EMPIC3 and detected a loss of cytoadhesion, but they did not attempt panning to see if this could rescue binding. To ensure that the lack of cytoadhesion in these cases is not serendipitous (as it was when they activated IT4var19), they should demonstrate that panning cannot rescue binding.

      These are very important considerations. Indeed, we had repeatedly attempted to pan 3D7 when we failed to get the SLI-generated 3D7 PfEMP1 expressor lines to bind, but this had not been successful. The lack of binding had been a major obstacle that had held up the project and was only solved when we moved to IT4 which readily bound (apart from Var19 which was created later in the project). After that we made no further efforts to understand why 3D7 does not bind but the fact that PfEMP1 is on the surface indicates this is not a PTP3 issue because loss of PTP3 also leads to loss of PfEMP1 surface display. Also, as the parent 3D7 could not be panned, we assumed this issue is not easily fixed in the SLI var lines we made in 3D7.

      Panning the TGD lines: we see the reasoning for conducting panning experiments with the TGD lines. However, on second thought, we are unsure this should be attempted. The outcome might not be easily interpretable as at least two forces will contribute to the selection in panning experiments with TGD lines that do not bind anymore:

      Firstly, panning would work against the SLI of the TGD, resulting in a tug of war between the TGD-SLI and binding. This is because a small number of parasites will loop out the TGD plasmid (revert) and would normally be eliminated during standard culturing due to the SLI drug used for the TGD. These revertant cells would bind and the panning would enrich them. Hence, panning and SLI are opposed forces in the case of a TGD abolishing binding. It is unclear how strong this effect would be, but this would for sure lead to mixed populations that complicate interpretations. 

      The second selecting force are possible compensatory changes to restore binding. These can be due to different causes: (i) reversal of potential independent changes that may have occurred in the TGD parasites and that are in reality causing the binding loss (i.e. such as ptp3 loss or similar, the concern of the reviewer) or (ii) new changes to compensate the loss of the TGD target (in this case the TGD is the cause of the binding loss but for instance a different change ameliorates it by for instance increasing PfEMP1 expression or surface display). As both TGDs show some residual binding and have VAR01 on the surface to at least some extent, it is possible that new compensatory changes might indeed occur that indirectly increase binding again. 

      In summary, even if more binding occurs after panning of the lines, it is not clear whether this is due to a compensatory change ameliorating the TGD or reversal of an unrelated change or are counter-selections against the SLI. To determine the cause, the panned TGD lines would need to be subjected to a complex and time-consuming analysis (WGS, RNASeq, possibly Maurer’s clefts phenotype) to find out whether they were SLI-revertants, or had an unrelated chance that was reverted or a new compensatory change that helps binding. This might be further muddled if a mix of cells come out of the selection that have different changes of the options indicated above. In that case, it might even require scRNASeq to make sense of the panning experiment. Due to the envisaged difficulty in interpreting the outcome, we did not attempt this panning.

      To exclude loss of ptp3 expression as the reason for binding loss (something we would not have seen in the WGS if it is only due to a transcriptional change), we now carried out RNASeq with the TGD lines that have a binding phenotype. While we did not generate replicas to obtain quantitative data, the results show that both ptp3 copies were expressed in these TGDs comparable to other parasite lines that do bind with the same SLI-activated var gene, indicating that the effect is not due to ptp3 (see response to point 4 on PTP3 expression in the Recommendations for the authors). While we can’t fully exclude other changes in the TGDs that might affect binding, the WGS did not show any obvious alterations that could be responsible for this. 

      (2) The authors perform a series of trafficking experiments to help discern whether PfEMP1 is trafficked through PTEX. While the results were not entirely definitive, they make a strong case for PTEX in PfEMP1 export. The authors then used BioID to obtain a proxiome for PfEMP1 and identified proteins they suggest are involved in PfEMP1 trafficking. However, it seemed that components of PTEX were missing from the list of interacting proteins. Is this surprising and does this observation shed any additional light on the possibility of PfEMP1 trafficking through PTEX? This warrants a comment or discussion.

      This is an interesting point and we agree that this warrants to be discussed. A likely reason why PTEX components are not picked up as interactors is that BirA* is expected to be unfolded when it passes through the channel and in that state can’t biotinylate. Labelling likely would only be possible if PfEMP1 lingered at the PTEX translocation step before BirA* became unfolded to go through the channel which we would not expect under physiological conditions. We added the following sentences to the discussion: “While our data indicates PfEMP1 uses PTEX to reach the host cell, this could be expected to have resulted in the identification of PTEX components in the PfEMP1 proxiomes, which was not the case. However, as BirA* must be unfolded to pass through PTEX, it likely is unable to biotinylate translocon components unless PfEMP1 is stalled during translocation. For this reason, a lack of PTEX components in the PfEMP1 proxiomes does not necessarily exclude passage through PTEX.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Most of my comments are in the public section. I would just highlight a few things:

      (1) In the binding studies section you talk about "human brain endothelial cells (HBEC-5i)". These cells do indeed express CSA but this is a property of their immortalisation rather than being brain endotheliium, which does not express CSA. I think this could be confusing to readers so I think you might want to reword this sentence to focus on CSA expressing the cell line rather than other features.

      We thank the reviewer for pointing this out, we now modified the sentence to focus on the fact these are CSA expressing cells and provided a reference for it.

      (2) As I said in the public section, CHO cells are great for proof of concept studies, but they are not endothelium. Not a problem for this paper.

      Noted! Please also see our response to the public review.

      (3) I wonder whether your comment about how well tolerated the Bir3* insertion is may be a bit too strong. I might say "Nonetheless, overall the BirA* modified PfEMP1 were functional."

      Changed as requested.

      (4) I'm not sure how you explain the IFA staining patterns to the uninitiated, but perhaps you could explain some of the key features you are looking for.

      We apologise for not giving an explanation of the IFA staining patterns in the first place. Please see detailed response to public review of this reviewer (point 3 on PTP1-TGD phenotype) and to reviewer 2 (Recommendations to the authors, points 6 and 7 on better explaining and quantifying the Maurer’s clefts phenotypes). For this we now also generated parasites that episomally express mCherry tagged SBP1 in the TGD parasites with the reduced binding phenotype. This resulted in amendments to Fig. S7, addition of a Fig. S8 and updated results to better explain the phenotypes. 

      This is a great paper - I just wish I'd had this system before.

      Thank you!

      Reviewer #2 (Recommendations for the authors):

      Major Comments

      (1) Does the RNAseq analysis of 3D7var0425800 and 3D7MEEDvar0425800 (Figure 1G, H) reveal any differential gene expression that might suggest a basis for loss of mutually exclusive var expression in the MEED line?

      We now carried out a thorough analysis of these RNASeq experiments to look for an underlying cause for the phenotype. This was added as new Figure 1J and new Table S3. This analysis again illustrated the increased transcript levels of var genes. In addition, it showed that transcripts of a number of other exported proteins, including members of other gene families, were up in the MEED line. 

      One hit that might be causal of the phenotype was sip2, which was down by close to 8-fold (pAdj 0.025). While recent work in P. berghei found this ApiAP2 to be involved in the expression of merozoite genes (Nishi et al., Sci Advances 2025(PMID: 40117352)), previous work in P. falciparum showed that it binds heterochromatic telomere regions and certain var upstream regions (Flück et al., PlosPath 2010 (PMID: 20195509), now cited in the manuscript). The other notable change was an upregulation of the non-coding RNA ruf6 which had been linked with impaired mono-allelic var expression (Guizetti et al., NAR 2016 (PMID: 27466391), now also cited in the manuscript). While it would go beyond this manuscript to follow this up, it is conceivable that alterations in chromosome end biology due to sip2 downregulation or upregulation of ruf6 are causes of the observed phenotype

      We now added a paragraph on the more comprehensive analysis of the RNA Seq data of the MEED vs non-MEED lines at the end of the second results section.

      (2) Could the inability of the PfEMP1-mDHFR fusion to block translocation (Fig 2A) reflect unique features of PfEMP1 trafficking, such as the existence of a soluble, chaperoned trafficking state that is not fully folded? Was a PfEMP1-BPTI fusion ever tested as an alternative to mDHFR?

      This is an interesting suggestion. The PfEMP1-BPTI was never tested. However, a chaperoned trafficking state would likely also affect BPTI. Given that both domains (mDHFR and BPTI) in principle do the same when folded and would block when the construct is in the PV, it is not so likely that using a different blocking domain would make a difference. Therefore, the scenario where BPTI would block when mDHFR does not, is not that probable. The opposite would be possible (mDHFR blocking while BPTI does not, because only the latter depends on the redox state). However, this would only happen if the block  occurred before the construct reaches the PV.

      At present, we believe the lacking block to be due to the organization of the domains in the construct. In the PfEMP1-mDHFR construct in this manuscript the position of the blocking domain is further away from the TMD compared to all other previously tested mDHFR fusions. Increased distance to the TMD has previously been found to be a factor impairing the blocking function of mDHFR (Mesen-Ramirez et al., PlosPath 2016 (PMID: 27168322)). Hence, our suspicion that this is the reason for the lacking block with the PfEMP1-mDHFR rather than the type of blocking domain. However, the latter option can’t be fully excluded and we might test BPTI in future work.

      (3) The late promoter SBP1-mDHFR is 2A fused with the KAHRP reporter. Since 2A skipping efficiency varies between fusion contexts and significant amounts of unskipped protein can be present, it would be helpful to include a WB to determine the efficiency of skipping and provide confidence that the co-blocked KAHRP in the +WR condition (Fig 2D) is not actually fused to the C-terminus of SBP1-mDHFR-GFP.

      Fortunately, this T2A fusion (crt_SBP1-mDHFR-GFP-2A-KAHRP-mScarlet<sup>epi</sup>) was used before in work that included a Western blot showing its efficient skipping (S3 A Fig in MesenRamirez et al., PlosPath 2016). In agreement with these Western blot result, fluorescence microscopy showed very limited overlap of SBP1-mDHFR-GFP and KAHRP-mCherry in absence of WR (Fig. 3B in Mesen-Ramirez et al., PlosPath 2016 and Fig. 2 in this manuscript) which would not be the case if these two constructs were fused together. Please note that KAHRP is known to transiently localize to the Maurer’s clefts before reaching the knobs (Wickham et al., EMBOJ 2001, PMID: 11598007), and therefore occasional overlap with SBP1 at the Maurer’s clefts is expected. However, we would expect much more overlap if a substantial proportion of the construct population would not be skipped and therefore the co-blocked KAHRP-mCherry in the +WR sample is unlikely to be due to inefficient skipping and attachment to SBP1-mDHFR-GFP.

      (4) Does comparison of RNAseq from the various 3D7 and IT4 lines in the study provide any insight into PTP3 expression levels between strains with different binding capacities? Was the expression level of ptp3a/b in the IT4var19 panned line similar to the expression in the parent or other activated IT4 lines? Could the expanded ptp3 gene number in IT4 indicate that specialized trafficking machinery exists for some PfEMP1 proteins (ie, IT4var19 requires the divergent PTP3 paralog for efficient trafficking)?

      PTP3 in the different IT4 lines that bind:

      In those parasite lines that did bind, the intrinsic variation in the binding assays, the different binding properties of different PfEMP1 variants and the variation in RNA Seq experiments to compare different parasite lines precludes a correlation of binding level vs ptp3 expression. For instance, if a PfEMP1 variant has lower binding capacity, ptp3 may still be higher but binding would be lower than if comparing to a parasite line with a better binding PfEMP1 variant. Studying the effect of PTP3 levels on binding could probably be done by overexpressing PTP3 in the same PfEMP1 SLI expressor line and assessing how this affects binding, but this would go beyond this manuscript.

      PTP3 in panned vs unpanned Var19:

      We did some comparisons between IT4 parent, and the IT4-Var19 panned and unpanned

      (see Author response table 1). This did not reveal any clear associations. While the parent had somewhat lower ptp3 transcript levels, they were still clearly higher than in the unpanned Var19 line and other lines had also ptp3 levels comparable to the panned IT4-Var19 (see Author response table 2) 

      PTP3 in the TGDs and possible reason for binding phenotype:

      A key point is whether PTP3 could have influenced the lack of binding in the TGD lines (see also weakness section and point 1 of public review of reviewer 3: ptp3 may be an indirect cause resulting in lacking binding in TGD parasites). We now did RNA Seq to check for ptp3 expression in the relevant TGD lines although we did not do a systematic quantitative comparison (which would require 3 replicates of RNASeq), but we reasoned that loss of expression would also be evident in one replicate. There was no indication that the TGD lines had lost PTP3 expression (see Author response table 2) and this is unlikely to explain the binding loss in a similar fashion to the Var19 parasites. Generally, the IT4 lines showed expression of both ptp3 genes and only in the Var19 parasites before panning were the transcript levels considerably lower:

      Author response table 1.

      Parent vs IT4-Var19 panned and unpanned

      Author response table 2.

      TGD lines with binding phenotype vs parent

      The absence of an influence of PTP3 on the binding phenotype in the cell lines in this manuscript (besides Var19) is further supported by its role in PfEMP1 surface display. Previous work has shown that KO of ptp3 leads to a loss of VAR2CSA surface display (Maier et al., Cell 2008). The unpanned Var19 parasite also lacked PfEMP1 surface display and panning and the resulting appearance of the binding phenotype was accompanied by surface display of PfEMP1. As both, the EMPIC3 and TryThra-TGD lines had still at least some PfEMP1 on the surface, this also (in addition to the RNA Seq above) speaks against PTP3 being the cause of the binding phenotype. The same applies to 3D7 which despite the poor binding displays PfEMP1 on the host cell surface (Figure 1D). This indicating that also the binding phenotype in 3D7 is not due to PTP3 expression loss, as this would have abolished PfEMP1 surface display. 

      The idea about PTP3 paralogs for specific PfEMP1s is intriguing. In the future it might be interesting to test the frequency of parasites with two PTP3 paralogs in endemic settings and correlate it with the PfEMP1 repertoire, variant expression and potentially disease severity. 

      (5) The IT4var01 line shows substantially lower binding in Figure 5F compared with the data shown in Figure 4E and 6F. Does this reflect changes in the binding capacity of the line over time or is this variability inherent to the assay?

      There is some inherent variability in these assays. While we did not systematically assess this, we had no indication that this was due to the parasite line changing. The Var01 line was cultured for months and was frozen down and thawed more than once without a clear gradual trend for more or less binding. While we can’t exclude some variation from the parasite side, we suspect it is more a factor of the expression of the receptor on the CHO cells the iRBCs bind to. 

      Specifically, the assays in Fig. 6F and 4E mentioned by the reviewer both had an average binding to CD36 of around 1000 iE/mm2, only the experiments in Fig. 5F are different (~ 500 iE/mm2) but these were done with a different batch of CHO cells at a different time to the experiments in Fig. 6F and 4E. 

      (6) In Figure S7A, TryThrA and EMPIC3 show distinct localization as circles around the PfEMP1 signal while PeMP2 appears to co-localize with PfEMP1 or as immediately adjacent spots (strong colocalization is less apparent than SBP1, and the various PfEMP1 IFAs throughout the study). Does this indicate that TryThrA and EMPIC3 are peripheral MC proteins? Does this have any implications for their function in PfEMP1 binding? Some discussion would help as these differences are not mentioned in the text. For the EMPIC3 TGD IFAs, localization of SBP1 and PfEMP1 is noted to be normal but REX1 is not mentioned (although this also appears normal).

      We apologise for the lacking description of the candidate localisations and cursory description of the Maurer’s clefts phenotypes (next point). Our original intent was to not distract too much from the main flow of the manuscript as almost every part of the manuscript could be followed up with more details. However, we fully agree that this is unsatisfactory and now provided more description (this point) and more data (next point).

      Localisation of TryThrA and EMPIC3 compared to PfEMP1 at the Maurer’s clefts: the circular pattern is reminiscent of the results with Maurer’s clefts proteins reported by McMillan et al using 3D-SIM in 3D7 parasites (McMillan et al., Cell Microbiology 2014 (PMID: 23421990)). In that work SBP1 and MAHRP1 (both integral TMD proteins) were found in foci but REX1 (no TMD) in circular structures around these foci similar to what we observed here for TryThrA and EMPIC3 which both also lack a TMD. The SIM data in McMillan et al indicated that also PfEMP1 is “more peripheral”, although it did only partially overlap with REX1. The conclusion from that work was that there are sub-compartments at the Maurer’s clefts. In our IFAs (Fig. S7A) PfEMP1 is also only partially overlapping with the TryThrA and EMPIC3 circles, potentially indicating similar subcompartments to those observed by 3D-SIM. We agree with the reviewer that this might be indicative of peripheral MC proteins, fitting with a lack of TMD in these candidates, but we did not further speculate on this in the manuscript.

      We now added enlargements of the ring-like structures to better illustrate this observation in Fig. S7A. In addition, we now specifically mention the localization data and the ring like signal with TryThrA and EMPIC3 in the results and state that this may be similar to the observations by McMillan et al., Cell Microbiology 2014.

      We also thank the reviewer for pointing out that we had forgotten to mention REX1 in the EMPIC3-TGD, this was amended.  

      (7) The atypical localization in TryThrA TGD line claimed for PfEMP1 and SBP1 in Fig S7B is not obvious. While most REX1 is clustered into a few spots in the IFA staining for SBP1 and REX1, SBP1 is only partially located in these spots and appears normal in the above IFA staining for SBP1 and HA. The atypical localization of PfEMP1-HA is also not obvious to me. The authors should clarify what is meant by "atypical" localization and provide support with quantification given the difference between the two SBP1 images shown.

      We apologise for the inadequate description of these IFA phenotypes. The abnormal signal for SBP1, REX1 and PfEMP1 in the TryThrA-TGD included two phenotypes found with all 3 proteins: 

      (1) a dispersed signal for these proteins in the host cell in addition to foci (the control and the other TGD parasites have only dots in the host cell with no or very little detectable dispersed signal). 

      (2) foci of disproportionally high intensity and size, that we assumed might be aggregation or enlargement of the Maurer’s clefts or of the detected proteins.

      The reason for the difference between the REX1 (aggregation) phenotype and the PfEMP1 and SBP1 (dispersed signal, more smaller foci) phenotypes in the images in Fig. S7B is that both phenotypes were seen with all 3 proteins but we chose a REX1 stained cell to illustrate the aggregation phenotype (the SBP1 signal in the same cell is similar to the REX1 signal, illustrating that this phenotype is not REX1 specific; please note that this cell also has a dispersed pool of REX1 and SBP1). 

      Based on the IFAs 66% (n = 106 cells) of the cells in the TryThrA-TGD parasites had one or both of the observed phenotypes. We did not include this into the previous version of the manuscript because a description would have required detouring from the main focus of this results section. In addition, IFAs have some limitations for accurate quantifications, particularly for soluble pools (depending on fixing efficiency and agent, more or less of a soluble pool in the host cell can leak out). 

      To answer the request to better explain and quantify the phenotype and given the limitations of IFA, we now transfected the TryThrA-TGD parasites with a plasmid mediating episomal expression of SBP1-mCherry, permitting live cell imaging and a better classification of the Maurer’s clefts phenotype. Due to the two SLI modifications in these parasites (using up 4 resistance markers) we had to use a new selection marker (mutated lactate transporter PfFNT, providing resistance to BH267.meta (Walloch et al., J. Med. Chem. 2020 (PMID: 32816478))) to transfect these parasites with an additional plasmid. 

      These results are now provided as Fig. S8 and detailed in the last results section. The new data shows that the majority of the TryThrA-TGD parasites contain a dispersed pool of SBP1 in the host cell. About a third of the parasites also showed disproportionally strong SBP1 foci that may be aggregates of the Maurer’s clefts. We also transfected the EMPIC3-TGD parasites with the FNT plasmid mediating episomal SBP1-mCherry expression and observed only few cells with a cytoplasmic pool or aggregates (Fig. S8). Overall these findings agree with the previous IFA results. As the IFA suggests similar results also for REX1 and PfEMP1, this defect is likely not SBP1 specific but more general (Maurer’s clefts morphology; association or transport of multiple proteins to the Maurer’s clefts). This gives a likely explanation for the cytoadherence phenotype in the TryThrA-TGD parasites. The reason for the EMPIC3-TGD phenotype remains to be determined as we did not detect obvious changes of the Maurer’s clefts morphology or in the transport of proteins to these structures in these experiments. 

      Minor comments

      (1) Italicized numbers in parenthesis are present in several places in the manuscript but it is not clear what these refer to (perhaps differently formatted citations from a previous version of the manuscript). Figure 1

      legend: (121); Figure S3 legend: (110), (111); Figure S6 legend: (66); etc.

      We thank the reviewer for pointing out this issue with the references, this was amended.

      (2) Figure 5A and legend: "BSD-R: BSD-resistance gene". Blasticidin-S (BS) is the drug while Blasticidin-S deaminase (BSD) is the resistance gene.

      We thank the reviewer for pointing this out, the legend and figure were changed.

      (3) Figure 5E legend: µ-SBP1-N should be α-SBP1-N.

      This was amended.

      (4) Figure S5 legend: "(Full data in Table S1)" should be Table S3.

      This was amended.

      (5) Figure S1G: The pie chart shows PF3D7_0425700 accounts for 43% of rif expression in 3D7var0425800 but the text indicates 62%.

      We apologize for this mistake, the text was corrected. We also improved the citations to Fig. S1G and H in this section.

      (6) "most PfEMP1-trafficking proteins show a similar early expression..." The authors might consider including a table of proteins known to be required for EMP1 trafficking and a graph showing their expression timing. Are any with later expressions known?

      Most exported proteins are expressed early, which is nicely shown in Marti et al 2004 (cited for the statement) in a graph of the expression timing of all PEXEL proteins (Fig. 4B in that paper). PNEPs also have a similar profile (Grüring et al 2011, also cited for that statement), further illustrated by using early expression as a criterion to find more PNEPs (Heiber et al., 2013 (PMID: 23950716)). Together this includes most if not all of the known PfEMP1 trafficking proteins. The originally co-submitted paper (Blancke-Soares & Stäcker et al., eLife preprint doi.org/10.7554/eLife.103633.1) analysed several later expressed exported proteins

      (Pf332, MSRP6) but their disruption, while influencing Maurer’s clefs morphology and anchoring, did not influence PfEMP1 transport. However, there are some conflicting results for Pf332 (referenced in Blancke-Soares & Stäcker et al). This illustrates that it may not be so easy to decide which proteins are bona fide PfEMP1 trafficking proteins. We therefore did not add a table and hope it is acceptable for the reader to rely on the provided 3 references to back this statement.

      (7)  Figure S1J: The predominate var in the IT4 WT parent is var66 (which appears to be syntenic with Pf3D7_0809100, the predominate var in the 3D7 WT parent). Is there something about this locus or parasite culture conditions that selects for these vars in culture? Is this observed in other labs as well?

      This is a very interesting point (although we are not certain these vars are indeed syntenic, they are on different chromosomes). As far as we know at least Pf3D7_0809100 is commonly a dominant var transcribed in other labs and was found expressed also in sporozoites (Zanghì et al. Cell Rep. 2018). However, it is unclear how uniform this really is. For IT4 we do not know in full but have also here commonly observed centromeric var genes to be dominating transcripts in unselected parasite cultures. It is possible that transcription drifts to centromeric var genes in cultured parasites. However, given the anecdotal evidence, it is unknown to which extent this is related to an inherent switching and regulation regiment or a consequence of faulty regulation following prolonged culturing.

      (8) Figure 4B, C: Presumably the asterisks on the DNA gels indicate non-specific bands but this is not described in the legend. Why are non-specific bands not consistent between parent and integrated lanes?

      We apologize for not mentioning this in the legend, this was amended.

      It is not clear why the non-specific bands differ between the lines but in part this might be due to different concentrations and quality of DNA preps. A PCR can also behave differently depending on whether the correct primer target is present or not. If present, the PCR will run efficiently and other spurious products will be outcompeted, but in absence of the correct target, they might become detectable.  

      Overall, we do not think the non-specific bands are indications of anything untoward with the lines, as for instance in Fig. 4B the high band in the 5’ integration in the IT4 line (that does not occur anywhere else) can’t be due to a genomic change as this is the parental line and does not contain the plasmid for integration. In the same gel, the ori locus band of incorrect size (likely due to crossreaction of the primers to another var gene which due to the high similarity of the ATS region is not always fully avoidable), is present in both, the parent IT4 and the integrant line which therefore also is not of concern. In C there are a couple of bands of incorrect size in the Integration line. One of these is very faint and both are too large and again therefore are likely other vars that are inefficiently picked up by these primers. The reason they are not seen in the parent line is that there the correct primer binding site is present, which then efficiently produces a product that outcompetes the product derived from non-optimal matching primer products and hence appear in the Int line where the correct match is not there anymore. For these reasons we believe these bands are not of any concern.  

      (9) Figure 4C: Is there a reason KAHRP was used as a co-marker for the IFA detecting IT4var19 expression instead of SBP1 which was used throughout the rest of the study?

      This is a coincidence as this line was tested when other lines were tested for KAHRP. As there were foci in the host cell we were satisfied that the HA-tagged PfEMP1 is produced and the localization deemed plausible. 

      (10) Figure 6: Streptavidin labeling for the IT4var01-BirA position 3 line is substantially less than the other two lines in both IFA and WB. Does the position 3 fusion reduce PfEMP1 protein levels or is this a result of the context or surface display of the fusion? Interestingly, the position 3 trypsin cleavage product appears consistently more robust compared with the other two configurations. Does this indicate that positioning BirA upstream of the TM increases RBC membrane insertion and/or makes the surface localized protein more accessible to trypsin?

      It is possible that RBC membrane insertion or trypsin accessibility is increased for the position 3 construct. But there could also be other explanations:

      The reason for the more robustly detected protected fragment for the position 3 construct in the WB might also be its smaller size (in contrast to the other two versions, it does not contain BirA*) which might permit more efficient transfer to the WB membrane. In that case the more robust band might not (only) be due to better membrane insertion or better trypsin accessibility.

      The lower biotinylation signal with the position 3 construct might also be explained by the farther distance of BirA* to the ATS (compared to position 1 and 2), the region where interactors are expected to bind. The position 1 and 2 constructs may therefore generally be more efficient (as closer) to biotinylate ATS proximal proteins. Further, in the final destination (PfEMP1 inserted into the RBC membrane) BirA* would be on the other side of the membrane in the position 3 construct while in the position 1 and 2 constructs BirA* would be on the side of the membrane where the ATS anchors PfEMP1 in the knob structure. In that case, labelling with position 3 would come from interactions/proximities during transport or at the Maurer’s clefts (if there indeed PfEMP1 is not membrane embedded) and might therefore be less.

      Hence, while alterations in trypsin accessibility and RBC membrane insertion are possible explanations, other explanations exist. At present, we do not know which of these explanations apply and therefore did not mention any of them in the manuscript. 

      Reviewer #3 (Recommendations for the authors):

      (1) In the abstract and on page 8, the authors mention that they generate cell lines binding to "all major endothelial receptors" and "all known major receptors". This is a pretty allencompassing statement that might not be fully accepted by others who have reported binding to other receptors not considered in this paper (e.g. VCAM, TSP, hyaluronic acid, etc). It would be better to change this statement to something like "the most common endothelial receptors" or "the dominant endothelial receptors", or something similar.

      We agree with the reviewer that these statements are too all-encompassing and changed them to “the most common endothelial receptors” (introduction) and “the most common receptors” (results).

      (2) The authors targeted two rif genes for activation and in each case the gene became the most highly expressed member of the family. However, unlike var genes, there were other rif genes also expressed in these lines and the activated copy did not always make up the majority of rif mRNAs. The authors might wish to highlight that this is inconsistent with mutually exclusive expression of this gene family, something that has been discussed in the past but not definitively shown.

      We thank the reviewer for highlighting this, we now added the following statement to this section: “While SLI-activation of rif genes also led to the dominant expression of the targeted rif gene, other rif genes still took up a substantial proportion of all detected rif transcripts, speaking against a mutually exclusive expression in the manner seen with var genes.”

      (3) In Figure 6, H-J, the authors display volcano plots showing proteins that are thought to interact with PfEMP1. These are labeled with names from the literature, however, several are named simply "1, 2, 3, 4, 5, or 6". What do these numbers stand for?

      We apologize for not clarifying this and thank the reviewer for pointing this out. There is a legend for the numbered proteins in what is now Table S4 (previously Table S3). We now amended the legend of Figure 6 to explain the numbers and pointing the reader to Table S4 for the accessions.

    1. A reader may not have experienced similar life circumstances as yours, but that doesn’t mean the reader won’t be able to identify emotionally with what you and your characters go through. Human strife is human strife.

      very important to keep in mind because sometimes we think that our personal experiences aren't relatable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Colorectal cancer (CRC) is the third most common cancer globally and the second leading cause of cancer-related deaths. Colonoscopy and fecal immunohistochemical testing are among the early diagnostic tools that have significantly enhanced patient survival rates in CRC. Methylation dysregulation has been identified in the earliest stages of CRC, offering a promising avenue for screening, prediction, and diagnosis. The manuscript entitled "Early Diagnosis and Prognostic Prediction of Colorectal Cancer through Plasma Methylation Regions" by Zhu et al. presents that a panel of genes with methylation pattern derived from cfDNA (27 DMRs), serving as a noninvasive detection method for CRC early diagnosis and prognosis.

      Strengths:

      The authors provided evidence that the 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV.

      Weaknesses:

      The major concerns are the design of DMR screening, the relatively low sensitivity of this DMR pattern in detecting early-stage CRC, the limited size of the cohorts, and the lack of comparison with the traditional diagnosis test.

      We sincerely thank the reviewer for their thorough evaluation and constructive feedback on our manuscript. We are encouraged that the reviewer found our 27-DMR panel promising for predicting distant metastasis and for its performance in late-stage CRC. We have carefully considered the weaknesses pointed out and have made revisions to address these concerns, which we believe have significantly strengthened our paper.

      We agree with the reviewer that achieving high sensitivity for early-stage disease is the ultimate goal for any noninvasive screening test. Detecting the minute quantities of cfDNA shed from early-stage tumors is a well-recognized challenge in the field. Although the sensitivity of our current panel for early-stage CRC is modest, its core strengths, lie in its capability to also detect advanced adenomas and its excellent performance in assessing CRC metastasis and prognosis. Furthermore, we have now added a direct comparative analysis of our 27-DMR panel against the most widely used clinical serum biomarker for CRC, carcinoembryonic antigen (CEA), using samples from the same patient cohorts. Our results demonstrate that 27-DMR methylation score significantly outperforms CEA in diagnostic accuracy for early-stage CRC (64% vs. 18%) (Table s7). And in the Discussion section, we have also acknowledged our limitations and suggest that future studies are warranted to combine the cfDNA methylation model with commonly used clinical markers, such as CEA and CA19-9, with the aim of improving the sensitivity for early diagnosis.

      We acknowledge the reviewer's concern regarding the cohort size and validation in larger, prospective, multi-center cohorts is essential before this panel can be considered for clinical application. We have explicitly stated this as a limitation of our study in the Discussion section and have highlighted the need for future large-scale validation studies (Page 18, Lines 367-373). We once again thank the reviewer for their insightful comments, which have allowed us to substantially improve our manuscript. We hope that the revised version is now suitable for publication.

      Reviewer #2 (Public review):

      This work presents a 27-region DMR model for early diagnosis and prognostic prediction of colorectal cancer using plasma methylation markers. While this non-invasive diagnostic and prognostic tool could interest a broad readership, several critical issues require attention.

      Major Concerns:

      (1) Inconsistencies and clarity issues in data presentation

      (a) Sample size discrepancies

      The abstract mentions screening 119 CRC tissue samples, while Figure 1 shows 136 tissues. Please clarify if this represents 119 CRC and 17 normal samples.

      We sincerely thank the reviewer for this careful observation and for pointing out the inconsistency. We apologize for the error and the confusion it caused. Regarding Figure 1: The reviewer is correct. The number 136 in the original Figure 1 was an error. This was due to an inadvertent double-counting of the tumor samples that were used in the differential analysis against adjacent normal tissues. The actual number of tissue samples used in this analysis is 89. We have now corrected this value in the revised Figure 1.

      Regarding the Abstract: The 119 CRC tissue samples mentioned in the abstract represents the total number of unique tumor samples analyzed across all stages of our study. This number is composed of two cohorts: the initial 15 pairs of tissues used for preliminary screening, and the subsequent 89 tissue samples used for validation, totaling 119 samples. We have ensured all sample numbers are now consistent throughout the revised manuscript.

      The plasma sample numbers vary across sections: the abstract cites 161 samples, Figure 1 shows 116 samples, and the Supplementary Methods mentions 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC).

      We sincerely thank the reviewer for their meticulous review and for identifying these inconsistencies in the plasma sample numbers. We apologize for this oversight and the lack of clarity.

      Figure 1 & Supplementary Methods (77 samples): The number 116 in the original Figure 1 was a clerical error. The correct number is 77, which is the cohort used for our differential methylation analysis. This number is now consistent with the Supplementary Methods. This cohort is composed of 13 Normal, 15 NAA, 12 AA, and 37 CRC samples. The figure has been revised accordingly.

      Abstract (161 samples): The total of 161 plasma samples mentioned in the abstract is the sum of two distinct sample sets used for different stages of our analysis: The 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC) used for the differential analysis.  An additional 84 samples (33 Normal, 51 CRC) which served as the training set for the LASSO regression model. We have now clarified these distinctions in the text and ensured consistency across the abstract, figures, and methods sections.

      (b) Methodological inconsistencies

      The Supplementary Material reports 477 hypermethylated sites from TCGA data analysis (Δβ>0.20, FDR<0.05), but Figure 1 indicates 499 sites.

      The manuscript states that analyzing TCGA data across six cancer types identified 499 CRC-specific methylation sites, yet Figure 1 shows 477. Please also explain the rationale for selecting these specific cancer types from TCGA.

      We sincerely thank the reviewer for their sharp observation and for highlighting these inconsistencies. We apologize for this clerical error, which occurred when labeling the figure. The numbers 477 and 499 in Figure 1 were inadvertently swapped and the text in Supplementary Material is correct. We have now corrected this error throughout the manuscript to ensure clarity and consistency. We deeply regret the confusion this has caused.

      Regarding the rationale for selecting the cancer types:

      The selection of colorectal, esophageal, gastric, lung, liver, and breast cancers was based on the following strategic criteria to ensure the stringent identification of CRC-specific markers. Firstly, esophageal, gastric, liver, and colorectal cancers all originate from the gastrointestinal tract and share developmental and functional similarities. Comparing CRC against these closely related cancers allowed us to filter out general GI-tract-related methylation patterns and isolate those that are truly unique to colorectal tissue. Secondly, we included lung and breast cancer as they are two of the most common non-GI malignancies worldwide with distinct tissue origins. This helps ensure our identified markers are not just pan-cancer methylation events but are specific to CRC, even when compared against highly prevalent cancers from different lineages. Finally, these six cancer types have some of the largest and most complete datasets available in the TCGA database, including high-quality methylation data. This provided a robust statistical foundation for a reliable cross-cancer comparison. We hope this explanation clarifies our methodology. Thank you again for your valuable feedback.

      "404 CRC-specific DMRs" mentioned in the main text while "404 MCBs" in Figure 1, the authors need to clarify if these terms are interchangeable or how MCBs are defined.

      We sincerely thank the reviewer for pointing out this important inconsistency in terminology. We apologize for the confusion this has caused and for the error in Figure 1. The two terms are closely related in our study. The final 404 markers are technically DMRs that were identified through an analysis of MCBs. To avoid confusion, we have decided to unify the terminology. The manuscript has now been revised to consistently use "DMRs", which is the most accurate final descriptor. The label in Figure 1 has been corrected accordingly.

      (2) Methodological documentation

      The Results section requires a more detailed description of marker identification procedures and justification of methodological choices.

      Figure 3 panels need reordering for sequential citation.

      We thank the reviewer for this valuable suggestion. We agree that the original Results section lacked sufficient detail regarding the marker identification procedures and the justification for our methodological choices. To address this, we have substantially rewritten the "Methylation markers selection" subsection. This revised section provides a clear, step-by-step narrative of our marker discovery. The revised text now integrates the specific methodological details and statistical criteria. For instance, we now explicitly describe the three-pronged approach for the initial TCGA data mining and the specific criteria (Δβ, FDR, log2FC) for each, and the analysis methodology such as Wilcoxon test and LASSO regression analysis. We believe this detailed narrative now provides the necessary description and justification for our methodological choices directly within the results, significantly improving the clarity and logical flow of our manuscript. This revision can be found on (Page 9-11, Lines 180-195, 202-213). We hope these changes fully address the reviewer's concerns.

      We thank the reviewer for pointing out the citation order of the panels in Figure 3. This was a helpful suggestion for improving the clarity of our manuscript. We have now reordered the panels in Figure 3 to ensure they are cited sequentially within the text. These adjustments have been made in the "Development and validation of the CRC diagnosis model" subsection of the Results (Page 11, lines 224-230). We appreciate the reviewer's attention to detail.

      (3) Quality control and data transparency

      No quality control metrics are presented for the in-house sequencing data (e.g., sequencing quality, alignment rate, BS conversion rate, coverage, PCA plots for each cohort).

      The analysis code should be publicly available through GitHub or Zenodo.

      At a minimum, processed data should be made publicly accessible to ensure reproducibility.

      We sincerely thank the reviewer for their valuable and constructive feedback regarding quality control and data transparency. We fully agree that these elements are crucial for ensuring the robustness and reproducibility of our research. As the reviewer suggested, we have made all processed data and the key quality control metrics for each sample including sequencing quality scores, bisulfite (BS) conversion rates, and sequencing coverage publicly available to ensure the reproducibility of our findings. The analysis was performed using standard algorithms as detailed in the Methods section. While we are unable to host the code in a public repository at this time, all analysis scripts are available from the corresponding author upon reasonable request. The data has been deposited in the National Genomics Data Center (NGDC) and is accessible under the accession number OMIX009128. This information is now clearly stated in the "Data and Code Availability" section of the manuscript. We thank the reviewer again for pushing us to improve our manuscript in this critical aspect.

      Reviewer #3 (Public review):

      Summary:

      This article provides a model for early diagnosis and prognostic prediction of Colorectal Cancer and demonstrates its accuracy and usability. However, there are still some minor issues that need to be revised and paid attention to.

      Strengths:

      A large amount of external datasets were used for verification, thus demonstrating robustness and accuracy. Meanwhile, various influencing factors of multiple samples were taken into account, providing usability.

      Weaknesses:

      There are notable language issues that hinder readability, as well as a lack of some key conclusions provided.

      We are very grateful to the reviewer for their positive assessment of our study and for the constructive feedback provided. We are particularly encouraged that the reviewer recognized the strengths of our work, especially the robustness demonstrated through extensive external validation and the practical usability of our model. Regarding the weaknesses, we have taken the comments very seriously and have thoroughly revised the manuscript. We sincerely apologize for the language issues that hindered readability in our initial submission. To address this, the entire manuscript has undergone a comprehensive round of professional language polishing and editing. We have carefully reviewed and revised the text to improve clarity, flow, and grammatical accuracy. Besides, we agree that the conclusions could be stated more explicitly. To rectify this, we have substantially revised the final paragraph of the Discussion and the Conclusion section (Page 14-18, lines 279-305, 319-334, 346-348, 358-360, 367-379). We now more clearly summarize the main findings of our study, emphasize the clinical significance and potential applications of our model, and provide clear take-home messages. We thank you again for your time and insightful comments, which have been invaluable in improving the quality of our paper. We hope the revised manuscript now meets the standards for publication.

      Reviewer #1 (Recommendations for the authors):

      Detail comments are outlined below:

      (1) In this study, the authors have highlighted methylated cfDNA as a noninvasive approach for CRC early diagnosis. However, the small size of cohorts for plasma screening, particularly the sample number of NAA and AA , may cause bias in the selection of DMRs. This bias may lead to inappropriate DMRs for early diagnosis. Furthermore, the similar issues for the training set with a high percentage of late-stage CRC, no AA or NAA samples were included. This absence may be the key factor in screening changed methylated cfDNA that can predict the early stages of CRC.

      We are very grateful to the reviewer for this insightful methodological critique. We agree that cohort composition and sample size are critical factors in the development of robust biomarkers, and we appreciate the opportunity to clarify our study design and the interpretation of our results.

      We agree with the reviewer that the number of precancerous lesion samples (NAA and AA) in our initial plasma screening cohort was limited. This is a valid point. However, it is important to contextualize the role of this step within our overall multi-stage marker selection funnel. The markers evaluated in this plasma cohort were not discovered from this small sample set alone. They were the result of a rigorous pre-selection process based on large-scale public TCGA data and our own tissue-level sequencing. This robust, tissue-based validation ensured that only the most promising CRC-specific markers were advanced for plasma testing. Therefore, while the plasma cohort was modest in size, its purpose was to confirm the circulatory detectability of markers already known to have a strong tissue-of-origin signal, thereby mitigating the potential bias from a smaller discovery set.

      Our primary aim was to first build a model that could robustly and accurately identify a definitive cancer-specific methylation signal. By training the model on clear-cut invasive cancer cases versus healthy controls, we could isolate the most powerful and specific markers for established malignancy. Our working hypothesis was that these strong cancer-specific methylation patterns are initiated during the precursor stages and would therefore be detectable, albeit at lower levels, in precancerous lesions.  Unfortunately, the panel could only identify a limited proportion of precancerous lesions (48.4% in the NAA group and 52.2% in the AA group). We fully agree with the reviewer's sentiment that including a larger and more balanced set of precancerous lesions in future training cohorts could potentially optimize a model specifically for adenoma detection. We have now explicitly added this point to our Discussion section, highlighting it as an important direction for future research (Page 18, lines 367-373).

      (2) The sensitivity of 27 DMRs in the external validation set (for NAA, AA and CRC 0-Ⅱare 48.4%. 52.2% and 66.7%, respectively) were much lower compared with previously published studies, like ColonES assay (DOI: 10.1016/j.eclinm.2022.101717) and ColonSecure test (DOI: 10.1186/s12943-023-01866-z). The 27 DMRs from the layered screening process did not show superior performance in a small population of an external validation cohort. Therefore, it is unlikely that this DMR pattern will be applicable to the general population in the future.

      We sincerely thank the reviewer for their insightful comments and for providing a thorough comparison with the highly relevant ColonES and ColonSecure assays. This has given us an important opportunity to clarify the unique contributions and specific clinical applications of our 27-DMR panel.

      We acknowledge the reviewer's point that the sensitivities of our panel for precancerous lesions (NAA: 48.4%, AA: 52.2%), while substantial, are numerically lower than those reported by the excellent ColonES assay (AA: 79.0%). However, it is important to clarify that while the ColonES and ColonSecure tests are outstanding benchmarks designed primarily for early detection and screening, the primary objective and contribution of our study were slightly different. Our model demonstrated an exceptional ability to predict distant metastasis with an AUC of 0.955 and a strong capacity for predicting overall prognosis with an AUC of 0.867. Our goal was to develop a multi-functional, biologically-rooted biomarker panel that not only contributes to early detection but, more importantly, provides crucial information for post-diagnosis patient management, including staging, risk stratification, and prognostication, from a single preoperative sample. We believe this ability to preoperatively identify high-risk patients who may require more aggressive treatment or intensive surveillance is the key contribution of our work. It provides a distinct clinical utility that complements, rather than directly competes with, pure screening assays.

      We agree with the reviewer that our external validation was performed on a limited cohort, and we have acknowledged this as a limitation in our Discussion section. However, the purpose of this validation was to provide a proof-of-concept for the panel's performance across its multiple functions. The promising and exceptionally high-performing results in the prognostic domain strongly warrant further validation in larger, prospective, multi-center cohorts.

      (3) The 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV. In contrast, the increase of AA and 0-II groups was very mild in the validation cohort. This observation raises concerns regarding the study design, particularly in the context of the layered screening process and sample assigning.

      We sincerely thank the reviewer for this insightful and critical comment. We agree with the reviewer's observation that the methylation score increased more remarkably in late-stage (III-IV) CRC compared to the milder increase in adenoma (AA) and early-stage (0-II) CRC in the validation cohort. However, the observed pattern is biologically plausible and consistent with the nature of colorectal cancer progression. Carcinogenesis is a multi-step process involving the gradual accumulation of genetic and epigenetic alterations. The methylation changes we identified are likely associated with tumor progression and metastasis. Therefore, it is expected that advanced, metastatic cancers (Stage III-IV), which have undergone significant biological changes, would exhibit a much stronger and more robust methylation signal compared to pre-cancerous lesions (adenomas) or early-stage, non-metastatic cancers (Stage 0-II). The "mild" increase in early stages reflects the initial, more subtle epigenetic alterations, while the "remarkable" increase in late stages reflects the extensive changes required for invasion and metastasis. We believe this graduated increase actually strengthens the validity of our methylation signature, as it mirrors the underlying biological progression of the disease. We hope this response and the corresponding revisions address the reviewer's comments.

      (4) The authors did not provide the 27 DMRs prediction efficacy comparison with other noninvasive CRC assays, like a CEA and a FIT test.

      Thank you for this valuable suggestion. We agree that comparing our model with established non-invasive assays is crucial for demonstrating its clinical potential. Following your advice, we have now included a direct comparison of the diagnostic performance between our model and the traditional tumor marker, carcinoembryonic antigen (CEA), using the external validation cohort. The results show that our model has a significantly higher sensitivity for detecting early-stage colorectal cancer and adenomas compared to CEA. This detailed comparison has been added as Table s7 in the supplementary materials, and the corresponding description has been incorporated into the Results section of our manuscript (Page 12, lines 234-236). Regarding the Fecal Immunochemical Test (FIT), we unfortunately could not perform a direct statistical comparison because very few individuals in our cohort had undergone FIT. A comparison based on such a small sample size would lack statistical power and might not yield meaningful conclusions. We have acknowledged this as a limitation of our study in the Discussion section.We believe these additions and clarifications have substantially strengthened our manuscript. Thank you again for your constructive feedback.

      (5) The authors did not explicitly describe how they assigned the plasma samples to the distinct sets, nor did they specify the criteria for the plasma screen set, training set, and validation set. The detailed information for the patient grouping should be listed.

      Responce: Thank you for this essential feedback. We agree that a transparent and detailed description of the sample allocation process is crucial for the manuscript. We apologize for the previous lack of clarity and have now revised the Methods section to address this. Our patient cohorts were assigned to the screening, training, and validation sets based on a chronological splitting strategy. Specifically, samples were allocated based on the date of collection in a consecutive manner. This approach was chosen to minimize selection bias and to provide a more realistic, forward-looking assessment of the model's performance, simulating a prospective validation scenario. The screening set comprised 89 tissue samples and 77 plasma samples collected between June to December 2020. The primary purpose of this set was for the initial discovery and screening of potential methylation markers. The training set and validation set included 165 plasma samples collected from December 2020 to July 2022. The external validation cohort comprised 166 plasma samples collected from from July 2022 to December 2022. The subsection titled "Study design and samples" within the Methods section of the revised manuscript, which now contains all of this detailed information (Page 6, lines 116-133). We believe this detailed explanation now makes our study design clear and transparent. Thank you again for helping us improve our manuscript.

      Reviewer #2 (Recommendations for the authors):

      The manuscript requires significant language editing to improve clarity and readability. We recommend that the authors seek professional editing services for revision.

      Thank you for your constructive comments on the language of our manuscript. We apologize for any lack of clarity in the previous version. To address this, we have performed a thorough revision of the manuscript. The text has been carefully reviewed and edited by a native English-speaking colleague who is an expert in our research field. We have focused on correcting all grammatical errors, improving sentence structure, and refining the phrasing throughout the document to enhance readability. We are confident that these extensive revisions have significantly improved the clarity of the manuscript. We hope you will find the current version much easier to read and understand.

      Reviewer #3 (Recommendations for the authors):

      (1) However, I think the abstract part of the article is too detailed and should be more concise and shortened. It is not necessary to show detailed values but to summarize the results.

      Thank you for this valuable suggestion. We agree that the previous version of the abstract was overly detailed and that a more concise summary would be more effective for the reader. Following your advice, we have substantially revised the abstract. We have removed the specific numerical values (such as detailed statistics) and have instead focused on summarizing the key findings and their broader implications (Page 3, lines 54-60, 64-66, 70-72). The revised abstract is now shorter and provides a clearer, high-level overview of our study's background, methods, main results, and conclusions. We believe these changes have significantly improved its readability and impact. We hope you will find the current version more appropriate.

      (2) Figure 4, the color in the legend and plot are not the same, and should be revised.

      Thank you for your careful attention to detail and for pointing out the color inconsistency in Figure 4. We apologize for this oversight. We have now corrected the figure as you suggested, ensuring that the colors in the legend perfectly match those in the plot. The revised Figure 4 has been updated in the manuscript. We appreciate your help in improving the quality of our figures.

      (3) Please pay attention to the article format, such as the consistency of fonts and punctuation marks. (For example, Lines 75 and Line 230).

      Thank you for your meticulous review and for pointing out the inconsistencies in our manuscript's formatting. We sincerely apologize for these oversights and any inconvenience they may have caused. Following your feedback, we have carefully corrected the specific issues you highlighted. Furthermore, we have conducted a thorough proofread of the entire manuscript to ensure consistency in all fonts, punctuation marks, and overall adherence to the journal's formatting guidelines. We appreciate your help in improving the presentation and professionalism of our paper.

    1. Author response:

      (1) General Statements

      We thank the Reviewers for a fair review of our work and helpful suggestions. We have significantly revised the manuscript in response to these suggestions. We provide a point-by-point response to the Reviewers below but wanted to highlight in our response a recurring concern related to the strong cell cycle arrest observed upon the acute FAM53C knock-down being different than the limited phenotypes in other contexts, including the knockout mice and DepMap data.

      First, we now show that we can recapitulate the strong G1 arrest resulting from the FAM53C knock-down using two independent siRNAs in RPE-1 cells, supporting the specificity of the effects.

      Second, the G1 arrest that results from the FAM53C knock-down is also observed in cells with inactive p53, suggesting it is not due to a non-specific stress response due to “toxic” siRNAs. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype.

      Third, we have performed experiments in other human cells, including cancer cell lines. As would be expected for cancer cells, the G1 arrest is less pronounced but is still significant, indicating that the G1 arrest is not unique to RPE-1 cells.

      Fourth, it is not unexpected that compensatory mechanisms would be activated upon loss of FAM53C during development or in cancer – which may explain the lack of phenotypes in vivo or upon long-term knockout. This has been true for many cell cycle regulators, either because of compensation by other family members that have overlapping functions, or by a larger scale rewiring of signaling pathways. 

      (2) Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      Summary: 

      Taylar Hammond and colleagues identified new regulators of the G1/S transition of the cell cycle.

      They did so by screening public available data from the Cancer Dependency Map, and identified FAM53C as a positive regulator of the G1/S transition. Using biochemical assays they then show that FAM53 interacts with the DYRK1A kinase to inhibit its function. DYRK1A in its is known to induce degradation of cyclin D, leading the authors to propose a model in which DYRK1Adependent cyclin D degradation is inhibited by FAM53C to permit S-phase entry. Finally the authors assess the effect of FAM53C deletion in a cortical organoid model, and in Fam53c knockout mice. Whereas proliferation of the organoids is indeed inhibited, mice show virtually no phenotype.  

      Major comments: 

      The authors show convincing evidence that FAM53C loss can reduce S-phase entry in cell cultures, and that it can bind to DYRK1A. However, FAM53 has multiple other binding partners and I am not entirely convinced that negative regulation of DYRK1A is the predominant mechanism to explain its effects on S-phase entry. Some of the claims that are made based on the biochemical assays, and on the physiological effects of FAM53C are overstated. In addition, some choices made methodology and data representation need further attention. 

      (1) The authors do note that P21 levels increase upon FAM53C. They show convincing evidence that this is not a P53-dependent response. But the claim that " p21 upregulation alone cannot explain the G1 arrest in FAM53C-deficient cells (line 138-139) is misleading. A p53-independent p21 response could still be highly relevant. The authors could test if FAM53C knockdown inhibits proliferation after p21 knockdown or p21 deletion in RPE1 cells. 

      The Reviewer raises a great point. Our initial statement needed to be clarified and also need more experimental support. We have performed experiments where we knocked down FAM53C and p21 individually, as well as in combination, in RPE-1 cells. These experiment show that p21 knock-down is not sufficient to negate the cell cycle arrest resulting from the FAM53C knockdown in RPE-1 cells (Figure 4B,C and Figure S4C,D).

      We now extended these experiments to conditions where we inhibited DYRK1A, and we also compared these data to experiments in p53-null RPE-1 cells. Altogether, these experiments point to activation of p53 downstream of DYRK1A activation upon FAM53C knock-down, and indicate that p21 is not the only critical p53 target in the cell cycle arrest observed in FAM53C knock-down cells (Figure 4 and Figure S4).

      (2) The authors do not convincingly show that FAM53C acts as a DYRK1A inhibitor in cells. Figures 4B+C and S4B+C show extremely faint P-CycD1 bands, and tiny differences in ratios. The P values are hovering around the 0.05, so n=3 is clearly underpowered here. Total CycD1 levels also correlate with FAM53C levels, which seems to affect the ratios more than the tiny pCycD1 bands. Why is there still a pCycD1 band visible in 4B in the GFP + BTZ + DYRK1Ai condition? And if I look at the data points I honestly don't understand how the authors can conclude from S4C that knockdown of siFAM53C increases (DYRK1A dependent) increases in pCycD1 (relative to total CycD1). In figure 5C, no blot scans are even shown, and again the differences look tiny. So the authors should either find a way to make these assays more robust, or alter their claims appropriately. 

      We appreciate these comments from the Reviewer and have significantly revised the manuscript to address them.

      The analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We removed previous panel 4B from the revised manuscript. For panels 4E and S4B (now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      The representative Western blot images for 5C-D (now 5F-G) in the original submission are shown in Figure 5E, we apologize if this was not clear. The differences are small, which we acknowledge in the revised manuscript. Note that several factors can affect Cyclin D levels in cells, including the growth rate and the stage of the cell cycle. Our FACS analysis shows that normal organoids have ~63% of cells in G1 and ~13% in S phase; the overall lower proportion of S-phase cells in organoids may make the immunoblot difference appear smaller, with fewer cycling cells resulting in decreased Cyclin D phosphorylation.

      Nevertheless, the Reviewer brings up a good point and comments from this Reviewer and the others made us re-think how to best interpret our results. As discussed above, we re-read carefully the Meyer paper and think that FAM53C’s role and DYRK1A activity in cells may be understood when considering levels of both CycD and p21 at the same time in a continuum. While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is likely that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      (3) The experiments to test if DYRK1A inhibition could rescue the G1 arrest observed upon FAM53C knockdown are not entirely convincing either. It would be much more convincing if they also perform cell counting experiments as they have done in Figures 1F and 1G, to complement the flow cytometry assays. I suggest that the authors do these cell counting experiments in RPE1 +/- P53 cells as well as HCT116 cells. In addition, did the authors test if P21 is induced by DYRK1Ai in HCT116 cells? 

      We repeated the experiments with the DYRK1A inhibitor and counted the cells. In p53-null RPE1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells.

      (4) The data in Figure 5C and 5D are identical, although they are supposed to represent either pCycD1 ratios or p21 levels. This is a problem because at least one of the two cannot be true. Please provide the proper data and show (representative) images of both data types.

      We apologize for these duplicated panels in the original submission. We now replaced the wrong panel with the correct data (Fig. 5F,G). 

      (5) Line 246: "Fam53c knockout mice display developmental and behavioral defects." I don't agree with this claim. The mutant mice are born at almost the expected Mendelian ratios, the body weight development is not consistently altered. But more importantly, no differences in adult survival or microscopic pathology were seen. The authors put strong emphasis on the IMPC behavioral analysis, but they should be more cautious. The IMPC mouse cohorts are tested for many other phenotypes related to behavior and neurological symptoms and apparently none of these other traits were changed in the IMPC Famc53c-/- cohort. Thus, the decreased exploration in a new environment could very well be a chance finding. The authors need to take away claims about developmental and behavioral defects from the abstract, results and discussion sections; the data are just too weak to justify this. 

      We agree with the Reviewer that, although we observed significant p-values, this original statement may not be appropriate in the biological sense. We made sure in the revised manuscript to carefully present these data.

      Minor comments: 

      (6) Can the authors provide a rationale for each of the proteins they chose to generate the list of the 38 proteins in the DepMap analysis? I looked at the list and it seems to me that they do not all have described functions in the G1/S transition. The analysis may thus be biased. 

      To address this point, we updated Table S1 (2nd tab) to provide a better rationale for the 38 factors chosen. Our focus was on the canonical RB pathway and we included RB binding proteins whose function had suggested they may also be playing a role in the G1/S transition. We do agree that there is some bias in this selection (e.g., there are more RB binding factors described) but we hope the Reviewer will agree with us that this list and the subsequent analysis identified expected factors, including FAM53C. Future studies using this approach and others will certainly identify new regulators of cell cycle progression.

      (7) Figure 1B is confusing to me. Are these just some (arbitrarily) chosen examples? Consider leaving this heatmap out altogether, of explain in more detail. 

      We agree with the Reviewer that this panel was not necessarily useful and possibly in the wrong place, and we removed it from the manuscript. We replaced it with a cartoon of top hits in the screen.

      (8) The y-axes in Figures 2C, 2D, 2E, and 4D are misleading because they do not start at 0. Please let the axis start at 0, or make axis breaks. 

      We re-graphed these panels.

      (9) Line 229: " Consequences ... brain development." This subheader is misleading, because the in vitro cortical organoid system is a rather simplistic model for brain development, and far away from physiological brain development. Please alter the header. 

      We changed the header to “Consequences of FAM53C inactivation in human cortical organoids in culture”.

      (10) Figure S5F: the gating strategy is not clear to me. In particular, how do the authors know the difference between subG1 and G1 DAPI signals? Do they interpret the subG1 as apoptotic cells? If yes, why are there so many? Are the culturing or harvesting conditions of these organoids suboptimal? Perhaps the authors could consider doing IF stainings on EdU or BrdU on paraffin sections of organoids to obtain cleaner data?

      Thank you for your feedback. The subG1 population in the original Figure S5F represents cells that died during the dissociation step of the organoids for FACS analysis. To address this point, we performed live & dead staining to exclude dead cells and provide clearer data. We refined gating strategy for better clarity in the new S5F panel.

      (11) Figure S6A; the labeling seems incorrect. I would think that red is heterozygous here, and grey mutant. 

      We fixed this mistake, thank you. 

      Reviewer #1 (Significance): 

      The finding that the poorly studied gene FAM53C controls the G1/S transition in cell lines is novel and interesting for the cell cycle field. However, the lack of phenotypes in Famc53-/- mice makes this finding less interesting for a broader audience. Furthermore, the mechanisms are incompletely dissected. The importance of a p53-indepent induction of p21 is not ruled out. And while the direct inhibitory interaction between FAM53C and DYRK1A is convincing (and also reported by others; PMID: 37802655), the authors do not (yet) convincingly show that DYRK1A inhibition can rescue a cell proliferation defect in FAM53C-deficient cells. 

      Altogether, this study can be of interest to basic researchers in the cell cycle field. 

      I am a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My technical expertise aligns well with the work presented throughout this paper, although I am not familiar with biolayer interferometry. 

      Reviewer #2 (Evidence, reproducibility and clarity): 

      Summary 

      In this study Hammond et al. investigated the role of Dual-specificity Tyrosine Phosphorylation regulated Kinase 1A (DYRK1) in G1/S transition. By exploiting Dependency Map portal, they identified a previously unexplored protein FAM53C as potential regulator of G1/S transition. Using RNAi, they confirmed that depletion of FAM53C suppressed proliferation of human RPE1 cells and that this phenotype was dependent on the presence protein RB. In addition, they noted increased level of CDKN1A transcript and p21 protein that could explain G1 arrest of FAM53Cdepleted cells but surprisingly, they did not observe activation of other p53 target genes. Proteomic analysis identified DYRK1 as one of the main interactors of FAM53C and the interaction was confirmed in vitro. Further, they showed that purified FAM53C blocked the ability of DYRK1 to phosphorylate cyclin D in vitro although the activity of DYRK1 was likely not inhibited (judging from the modification of FAM53C itself). Instead, it seems more likely that FAM53C competes with cyclin D in this assay. Authors claim that the G1 arrest caused by depletion of FAM53C was rescued by inhibition of DYRK1 but this was true only in cells lacking functional p53. This is quite confusing as DYRK1 inhibition reduced the fraction of G1 cells in p53 wild type cells as well as in p53 knock-outs, suggesting that FAM53C may not be required for regulation of DYRK1 function. Instead of focusing on the impact of FAM53C on cell cycle progression, authors moved towards investigating its potential (and perhaps more complex) roles in differentiation of IPSCs into cortical organoids and in mice. They observed a lower level of proliferating cells in the organoids but if that reflects an increased activity of DYRK1 or if it is just an off target effect of the genetic manipulation remains unclear. Even less clear is the phenotype in FAM53C knock-out mice. Authors did not observe any significant changes in survival nor in organ development but they noted some behavioral differences. Weather and how these are connected to the rate of cellular proliferation was not explored. In the summary, the study identified previously unknown role of FAM53C in proliferation but failed to explain the mechanism and its physiological relevance at the level of tissues and organism. Although some of the data might be of interest, in current form the data is too preliminary to justify publication.

      Major points 

      (1) Whole study is based on one siRNA to Fam53C and its specificity was not validated. Level of the knock down was shown only in the first figure and not in the other experiments. The observed phenotypes in the cell cycle progression may be affected by variable knock-down efficiency and/or potential off target effects. 

      We thank the Reviewer for raising this important point. First, we need to clarify that our experiments were performed with a pool of siRNAs (not one siRNA). Second, commercial antibodies against FAM53C are not of the best quality and it has been challenging to detect FAM53C using these antibodies in our hands – the results are often variable. In addition, to better address the Reviewer’s point and control for the phenotypes we have observed, we performed two additional series of experiments: first, we have confirmed G1 arrest in RPE-1 cells with individual siRNAs, providing more confidence for the specificity of this arrest (Fig. S1B); second, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (Fig. S1E,F and Fig. 4F).

      (2) Experiments focusing on the cell cycle progression were done in a single cell line RPE1 that showed a strong sensitivity to FAM53C depletion. In contrast, phenotypes in IPSCs and in mice were only mild suggesting that there might be large differences across various cell types in the expression and function of FAM53C. Therefore, it is important to reproduce the observations in other cell types. 

      As mentioned above, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (three cancer cell lines) (Fig. S1E,F and Fig. 4F).

      (3) Authors state that FAM53C is a direct inhibitor of DYRK1A kinase activity (Line 203), however this model is not supported by the data in Fig 4A. FAM53C seems to be a good substrate of DYRK1 even at high concentrations when phosphorylations of cyclin D is reduced. It rather suggests that DYRK1 is not inhibited by FAM53C but perhaps FAM53C competes with cyclin D. Further, authors should address if the phosphorylation of cyclin D is responsible for the observed cell cycle phenotype. Is this Cyclin D-Thr286 phosphorylation, or are there other sites involved? 

      We revised the text of the manuscript to include the possibility that FAM53C could act as a competitive substrate and/or an inhibitor.

      We removed most of the Cyclin D phosphorylation/stability data from the revised manuscript. As the Reviewers pointed out, some of these data were statistically significant but the biological effects were small. As discussed above in our response to Reviewer #1, the analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knockdown, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We note, however, that we used specific Thr286 phospho-antibodies, which have been used extensively in the field. Our data in Figure 1 with palbociclib place FAM53C upstream of Cyclin D/CDK4,6. We performed Cyclin D overexpression experiments but RPE-1 cells did not tolerate high expression of Cyclin D1 (T286A mutant) and we have not been able to conduct more ‘genetic’ studies. 

      (4) At many places, information on statistical tests is missing and SDs are not shown in the plots. For instance, what statistics was used in Fig 4C? Impact of FAM53C on cyclin D phosphorylation does not seem to be significant. In the same experiment, does DYRK1 inhibitor prevent modification of cyclin D? 

      As discussed above, we removed some of these data and re-focused the manuscript on p53-p21 as a second pathway activated by loss of FAM53C.

      (5) Validation of SM13797 compound in terms of specificity to DYRK1 was not performed. 

      This is an important point. We had cited an abstract from the company (Biosplice) but we agree that providing data is critical. We have now revised the manuscript with a new analysis of the compound’s specificity using kinase assays. These data are shown in Fig. S3F-H.

      (6) A fraction of cells in G1 is a very easy readout but it does not measure progression through the G1 phase. Extension of the S phase or G2 delay would indirectly also result in reduction of the G1 fraction. Instead, authors could measure the dynamics of entry to S phase in cells released from a G1 block or from mitotic shake off. 

      The Reviewer made a good point. As discussed in our response to Reviewer #1, with p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide.

      Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells. These data indicate that G1 entry by flow cytometry will not always translate into proliferation.

      Other points:

      (7) Fig. 2C, 2D, 2E graphs should begin with 0 

      We remade these graphs.

      (8) Fig. 5D shows that the difference in p21 levels is not significant in FAM53C-KO cells but difference is mentioned in the text. 

      We replaced the panel by the correct panel; we apologize for this error.

      (9) Fig. 6D comparison of datasets of extremely different sizes does not seem to be appropriate

      We agree and revised the text. We hope that the Reviewer will agree with us that it is worth showing these data, which are clearly preliminary but provide evidence of a possible role for FAM53C in the brain.

      (10) Could there be alternative splicing in mice generating a partially functional protein without exon 4? Did authors confirm that the animal model does not express FAM53C? 

      We performed RNA sequencing of mouse embryonic fibroblasts derived from control and mutant mice. We clearly identified fewer reads in exon 4 in the knockout cells, and no other obvious change in the transcript (data not shown). However, immunoblot with mouse cells for FAM53C never worked well in our hands. We made sure to add this caveat to the revised manuscript.

      Reviewer #2 (Significance): 

      Main problem of this study is that the advanced experimental models in IPSCs and mice did not confirm the observations in the cell lines and thus the whole manuscript does not hold together. Although I acknowledge the effort the authors invested in these experiments, the data do not contribute to the main conclusion of the paper that FAM53C/DYRK1 regulates G1/S transition. 

      Reviewer #3 (Evidence, reproducibility and clarity: 

      This paper identifies FAM53C as a novel regulator of cell cycle progression, particularly at the G1/S transition, by inhibiting DYRK1A. Using data from the Cancer Dependency Map, the authors suggest that FAM53C acts upstream of the Cyclin D-CDK4/6-RB axis by inhibiting DYRK1A.  Specifically, their experiments suggest that FAM53C Knockdown induces G1 arrest in cells, reducing proliferation without triggering apoptosis. DYRK1A Inhibition rescues G1 arrest in P53KO cells, suggesting FAM53C normally suppresses DYRK1A activity. Mass Spectrometry and biochemical assays confirm that FAM53C directly interacts with and inhibits DYRK1A. FAM53C Knockout in Human Cortical Organoids and Mice leads to cell cycle defects, growth impairments, and behavioral changes, reinforcing its biological importance. 

      Strength of the paper: 

      The study introduces a novel cell cycle control signalling module upstream of CDK4/6 in G1/S regulation which could have significant impact. The identification of FAM53C using a depmap correlation analysis is a nice example of the power of this dataset. The experiments are carried out mostly in a convincing manner and support the conclusions of the manuscript. 

      Critique: 

      (1) The experiments rely heavily on siRNA transfections without the appropriate controls. There are so many cases of off-target effects of siRNA in the literature, and specifically for a strong phenotype on S-phase as described here, I would expect to see solid results by additional experiments. This is especially important since the ko mice do not show any significant developmental cell cycle phenotypes. Moreover, FAM53C does not show a strong fitness effect in the depmap dataset, suggesting that it is largely non-essential in most cancer cell lines. For this paper to reach publication in a high-standard journal, I would expect that the authors show a rescue of the S-phase phenotype using an siRNA-resistant cDNA, and show similar S-phase defects using an acute knock out approach with lentiviral gRNA/Cas9 delivery. 

      We thank the Reviewer for this comment. Please refer to the initial response to the three Reviewers, where we discuss our use of single siRNAs and our results in multiple cell lines. Briefly, we can recapitulate the G1 arrest upon FAM53C knock-down using two independent siRNAs in RPE-1 cells. We also observe the same G1 arrest in p53 knockout cells, suggesting it is not due to a non-specific stress response. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype. Human cancer cell lines also arrest in G1 upon FAM53C knock-down, not just RPE-1 cells. Finally, we hope the Reviewer will agree with us that compensatory mechanisms are very common in the cell cycle – which may explain the lack of phenotypes in vivo or upon long-term knockout of FAM53C.

      (2) The S-phase phenotype following FAM53C should be demonstrated in a larger variety of TP53WT and mutant cell lines. Given that this paper introduces a new G1/S control element, I think this is important for credibility. Ideally, this should be done with acute gRNA/Cas9 gene deletion using a lentiviral delivery system; but if the siRNA rescue experiments work and validate an on-target effect, siRNA would be an appropriate alternative. 

      We now show data with three cancer cell lines (U2OS, A549, and HCT-116 – Fig. S1E,F and Fig. 4F), in addition to our results in RPE-1 cells and in human cortical organoids. We note that the knock-down experiments are complemented by overexpression data (Fig. 1G-I), by genetic data (our original DepMap screen), and our biochemical data (showing direct binding of FAM53C to DYRK1A).

      (3) The western blot images shown in the MS appear heavily over-processed and saturated (See for example S4B, 4A, B, and E). Perhaps the authors should provide the original un-processed data of the entire gels? 

      For several of our panels (e.g., 4E and S4B, now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      Data in 4A are also not a western blot but a radiograph.

      For immunoblots, we will provide all the source data with uncropped blots with the final submission.

      (4) A critical experiment for the proposed mechanism is the rescue of the FAM53C S-phase reduction using DYRK1A inhibition shown in Figure 4. The legend here states that the data were extracted from BrdU incorporation assays, but in Figure S4D only the PI histograms are shown, and the S-phase population is not quantified. The authors should show the BrdU scatterplot and quantify the phenotype using the S-phase population in these plots. G1 measurements from PI histograms are not precise enough to allow for conclusions. Also, why are the intensities of the PI peaks so variable in these plots? Compare, for example, the HCT116 upper and lower panels where the siRNA appears to have caused an increase in ploidy. 

      We apologize for the confusion and we fixed these errors, for most of the analyses, we used PI to measure G1 and S-phase entry. We added relevant flow cytometry plots to supplemental figures (Fig. S1G, H, I, as well as Fig. S4E and S4K, and Fig. S5F).

      (5) There's an apparent contradiction in how RB deletion rescues the G1 arrest (Figure 2) while p21 seems to maintain the arrest even when DYRK1A is inhibited. Is p21 not induced when FAM53C is depleted in RB ko cells? This should be measured and discussed. 

      This comment and comments from the two other Reviewers made us reconsider our model. We re-read carefully the Meyer paper and think that DYRK1A activity may be understood when considering levels of both CycD and p21 at the same time in a continuum (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is obvious that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      Reviewer #3 (Significance): 

      In conclusion, I believe that this MS could potentially be important for the cell cycle field and also provide a new target pathway that could be relevant for cancer therapy. However, the paper has quite a few gaps and inconsistencies that need to be addressed with further experiments. My main worry is that the acute depletion phenotypes appear so strong, while the gene is nonessential in mice and shows only a minor fitness effect in the depmap screens. More convincing controls are necessary to rule out experimental artefacts that misguide the interpretation of the results.

      We appreciate this comment and hope that the Reviewer will agree it is still important to share our data with the field, even if the phenotypes in mice are modest.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This valuable study examines how mammals descend effectively and securely along vertical substrates. The conclusions from comparative analyses based on behavioral data and morphological measurements collected from 21 species across a wide range of taxa are convincing, making the work of interest to all biologists studying animal locomotion.

      We would like to greatly thank the two reviewers for their time in reviewing this work, and for their valuable comments and suggestions that will help to improve this manuscript.

      Overall, we agree with the weaknesses raised, which are mainly areas for consideration in future studies: to study more species, and in a natural habitat context.

      We will nevertheless add a few modifications to improve the manuscript, notably by making certain figures more readable, and adding definitions and bibliography in the main text concerning gait characteristics.

      We also provide brief comments on each point of weakness raised by the reviewers below, in blue.

      Reviewer #1 (Public review):

      Summary:

      This unique study reports original and extensive behavioral data collected by the authors on 21 living mammal taxa in zoo conditions (primates, tree shrew, rodents, carnivorans, and marsupials) on how descent along a vertical substrate can be done effectively and securely using gait variables. Ten morphological variables reflecting head size and limb proportions are examined in relationship to vertical descent strategies and then applied to reconstruct modes of vertical descent in fossil mammals.

      Strengths:

      This is a broad and data-rich comparative study, which requires a good understanding of the mammal groups being compared and how they are interrelated, the kinematic variables that underlie the locomotion used by the animals during vertical descent, and the morphological variables that are associated with vertical descent styles. Thankfully, the study presents data in a cogent way with clear hypotheses at the beginning, followed by results and a discussion that addresses each of those hypotheses using the relevant behavioral and morphological variables, always keeping in mind the relationships of the mammal groups under investigation. As pointed out in the study, there is a clear phylogenetic signal associated with vertical descent style. Strepsirrhine primates much prefer descending tail first, platyrrhine primates descend sideways when given a choice, whereas all other mammals (with the exception of the raccoon) descend head first. Not surprisingly, all mammals descending a vertical substrate do so in a more deliberate way, by reducing speed, and by keeping the limbs in contact for a longer period (i.e., higher duty factors).

      Weaknesses:

      The different gait patterns used by mammals during vertical descent are a bit more difficult to interpret. It is somewhat paradoxical that asymmetrical gaits such as bounds, half bounds, and gallops are more common during descent since they are associated with higher speeds and lower duty factors. Also, the arguments about the limb support polygons provided by DSDC vs. LSDC gaits apply for horizontal substrates, but perhaps not as much for vertical substrates.

      We analyzed gait patterns using methods commonly found in the literature and discussed our results accordingly. However, the study of limbs support polygons was indeed developed specifically for studying locomotion on horizontal supports, and may not be applicable for studying vertical locomotion, which is in fact a type of locomotion shared by all arboreal species. In the future, it would be interesting to consider new methods for analyzing vertical gaits.

      The importance of body mass cannot be overemphasized as it affects all aspects of an animal's biology. In this case, larger mammals with larger heads avoid descending head-first. Variation in trunk/tail and limb proportions also covaries with different vertical descent strategies. For example, a lower intermembral index is associated with tail-first descent. That said, the authors are quick to acknowledge that the five lemur species of their sample are driving this correlation. There is a wide range of intermembral indices among primates, and this simple measure of forelimb over hindlimb has vital functional implications for locomotion: primates with relatively long hindlimbs tend to emphasize leaping, primates with more even limb proportions are typically pronograde quadrupeds, and primates with relatively long forelimbs tend to emphasize suspensory locomotion and brachiation. Equally important is the fact that the intermembral index has been shown to increase with body mass in many primate families as a way to keep functional equivalence for (ascending) climbing behavior (see Jungers, 1985). Therefore, the manner in which a primate descends a vertical substrate may just be a by-product of limb proportions that evolved for different locomotor purposes. Clearly, more vertical descent data within a wider array of primate intermembral indices would clarify these relationships. Similarly, vertical descent data for other primate groups with longer tails, such as arboreal cercopithecoids, and particularly atelines with very long and prehensile tails, should provide more insights into the relationship between longer tail length and tail-first descent observed in the five lemurs. The relatively longer hallux of lemurs correlates with tail-first descent, whereas the more evenly grasping autopods of platyrrhines allow for all four limbs to be used for sideways descent. In that context, the pygmy loris offers a striking contrast. Here is a small primate equipped with four pincer-like, highly grasping autopods and a tail reduced to a short stub. Interestingly, this primate is unique within the sample in showing the strongest preference for head-first descent, just like other non-primate mammals. Again, a wider sample of primates should go a long way in clarifying the morphological and behavioral relationships reported in this study.

      We agree with this statement. In the future, we plan to study other species, particularly large-bodied ones with varied intermembral indexes.

      Reconstruction of the ancient lifestyles, including preferred locomotor behaviors, is a formidable task that requires careful documentation of strong form-function relationships from extant species that can be used as analogs to infer behavior in extinct species. The fossil record offers challenges of its own, as complete and undistorted skulls and postcranial skeletons are rare occurrences. When more complete remains are available, the entire evidence should be considered to reconstruct the adaptive profile of a fossil species rather than a single ("magic") trait.

      We completely agree with this, and we would like to emphasize that our intention here was simply to conduct a modest inference test, the purpose of which is to provide food for thought for future studies, and whose results should be considered in light of a comprehensive evolutionary model.

      Reviewer #2 (Public review):

      Summary:

      This paper contains kinematic analyses of a large comparative sample of small to medium-sized arboreal mammals (n = 21 species) traveling on near-vertical arboreal supports of varying diameter. This data is paired with morphological measures from the extant sample to reconstruct potential behaviors in a selection of fossil euarchontaglires. This research is valuable to anyone working in mammal locomotion and primate evolution.

      Strengths:

      The experimental data collection methods align with best research practices in this field and are presented with enough detail to allow for reproducibility of the study as well as comparison with similar datasets. The four predictions in the introduction are well aligned with the design of the study to allow for hypothesis testing. Behaviors are well described and documented, and Figure 1 does an excellent job in conveying the variety of locomotor behaviors observed in this sample. I think the authors took an interesting and unique angle by considering the influence of encephalization quotient on descent and the experience of forward pitch in animals with very large heads.

      Weaknesses:

      The authors acknowledge the challenges that are inherent with working with captive animals in enclosures and how that might influence observed behaviors compared to these species' wild counterparts. The number of individuals per species in this sample is low; however, this is consistent with the majority of experimental papers in this area of research because of the difficulties in attaining larger sample sizes.

      Yes, that is indeed the main cost/benefit trade-off with this type of study. Working with captive animals allows for large comparative studies, but there is a risk of variations in locomotor behavior among individuals in the natural environment, as well as few individuals per species in the dataset. That is why we plan and encourage colleagues to conduct studies in the natural environment to compare with these results. However, this type of study is very time-consuming and requires focusing on a single species at a time, which limits the comparative aspect.

      Figure 2 is difficult to interpret because of the large amount of information it is trying to convey.

      We agree that this figure is dense. One possible solution would be to combine species by phylogenetic groups to reduce the amount of information, as we did with Fig. 3 on the dataset relating to gaits. However, we believe that this would be unfortunate in the case of speed and duty factor because we would have to provide the complete figure in SI anyway, as the species-level information is valuable. We therefore prefer to keep this comprehensive figure here and we will enlarge the data points to improve their visibility, and provide the figure with a sufficiently high resolution to allow zooming in on the details.

      Reviewer #1 (Recommendations for the authors):

      As indicated in the first section above, this is a strong comparative study that addresses important questions, relative to the evolution of arboreal locomotion in primates and close mammal relatives. My recommendations should be taken in the context of improving a manuscript that is already generally acceptable.

      (1) The terms symmetrical and asymmetrical gaits should be briefly defined in the main text (not just in the Methods section) by citing work done by Hildebrand and other relevant studies. To that effect, the statement on lines 96-97 about the convergence of symmetrical gaits is unclear. What does "Symmetrical gaits have evolved convergently in rodents, scandentians, carnivorans, and marsupials" mean? Symmetrical gaits such as the walk, run, trot, etc., are pretty the norm in most mammals and were likely found in metatherians and basal eutherians. This needs clarification. On line 239, the term "ambling" is used in the context of related asymmetrical gaits. To be clear, the amble is a type of running gait involving no whole-body aerial phase and is therefore a symmetrical gait (see Schmitt et al., 2006).

      We have added a definition of the terms symmetrical and asymmetrical gaits and added references in the introduction such as: “Symmetrical gaits are defined as locomotor patterns in which the footfalls of a girdle (a pair of fore- or hindlimbs) are evenly spaced in time, with the right and left limbs of a pair of limbs being approximately 50% out of phase with each other (Hildebrand, 1966, 1967). Symmetrical gaits can be further divided into two types: diagonal-sequence gaits, in which a hindlimb footfall is followed by that of the contralateral forelimb, and lateral-sequence gaits, in which a hindlimb footfall is followed by that of the ipsilateral forelimb (Hildebrand, 1967; Shapiro and Raichlen, 2005; Cartmill et al., 2007b). In contrast, asymmetrical gaits are characterized by unevenly spaced footfalls within a girdle, with the right and left limbs moving in near synchrony (Hildebrand, 1977).” Now found in lines 87-94.

      We corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      Thank you for pointing this out. We indeed did not use the right term to mention related asymmetrical gaits with increased duty factors. We removed the term « ambling » and the associated reference here. Now found in line 256.

      (2) Correlations are used in the paper to examine how brain mass scales with body mass. It is correct to assume that a correlation significantly different from 0 is indicative of allometry (in this case, positive). That said, lines are used in Figure S2 that go through the bivariate scatter plot. The vast majority of scaling studies rely on regression techniques to calculate and compare slopes, which are different statistically from correlations. In this case, a slope not significantly different from 1.0 would support the hypothesis of isometry based on geometric similarity (as brain mass and body mass are two volumes). The authors could refer to the work of Bob Martin and the 1985 edited book by Jungers and contributions therein. These studies should also be cited in the paper.

      Thank you for recommending us this better suited method. We replaced the correlations with major axis orthogonal regressions, as recommended by Martin and Barbour 1989. We found a positive slope for all species significantly different from 1 (0.36), indicating a negative allometry (we realized we were mistaken about the allometry terminology, initially reporting a “positive allometry” instead of a positive correlation).

      We corrected in the manuscript in the Results and Methods sections, and cited Martin and Barbour 1989 such as:

      “To ensure that the EQs of the different species studied are comparable and meaningful, we tested the allometry between the brain and body masses in our dataset following [84] and found a significant and positive slope for all species (major axis orthogonal regression on log transformed values: slope = 0.36, r<sup>2</sup> = 0.92, p = 5.0.10<sup>-12</sup>), indicating a negative allometry (r = 0.97, df = 19, p = 2.0.10<sup>-13</sup>), and similar allometric coefficients when restricting the analysis to phylogenetic groups (Fig. S2).” Now found in lines 289-298.

      - “To control that brain allometry is homogeneous among all phylogenetic groups, to be able to compare EQ between species, we computed major axis orthogonal regressions, following the recommendation of Martin and Barbour [84], between the Log transformed brain and body masses, over all species and by phylogenetic group using the sma package in R (Fig. S2).” Now found in lines 336-338.

      We also changed Figure S2 in Supplementary Information accordingly.

      (3) Trunk length is used as the denominator for many of the indices used in the study. In this way, trunk length is considered to be a proxy for body size. There should be a demonstration that trunk length scales isometrically with body mass in all of the mammals compared. If not the case, some of the indices may not be directly comparable.

      We did not use trunk length as a proxy for body mass, but to compute geometric body proportions in order to test whether intrinsic body proportions could be related to vertical descent behaviors, namely the length of the tail and of the fore- and hindlimbs relative to the animal. We chose those indices to quantify the capability of limbs to act as levers or counterweights to rotate the animals for this specific question of vertical descent behavior. We therefore do not think that body mass allometry with respect to trunk length is relevant to compare these indices across species here. Also, we don’t expect that trunk length (which is a single dimension) would scale isometrically with body mass, which scales more as a volume.

      (4) Given the numerous comparisons done in this study, a Bonferroni correction method should be considered to mitigate type I error (accepting a false positive).

      We had already corrected all our statistical tests using the Benjamini-Hochberg method to control for false positives; see the SuppTables Excel file for the complete results of the statistical analyses. We chose this method over the Bonferroni correction because the more modern and balanced Benjamini-Hochberg procedure is better suited for analyses involving a large number of hypotheses.

      (5) The terms "arm" and "leg" used in the main text and Table 1 are anatomically incorrect. Instead, the terms "forelimb" and hindlimb" should be used as they include the length sum of the stylopod, zeugopod, and autopod.

      Indeed, thank you for pointing that out. We have corrected this error within the manuscript as well as in the figures 4 and S3.

      (6) On p. 14, the authors make the statement that the postcranial anatomy of Adapis and Notharctus remains undescribed. The authors should consult the work of Dagosto, Covert, Godinot and others.

      We did not state that the postcranial remains of Adapis and Notharctus have not been described. However, we were unfortunately unable to find published illustrations of the known postcranial elements that could be reliably used in this study. To avoid any misunderstanding, we removed the sentence such as: “However, we could not find suitable illustrations of the known postcranial elements of these species in the literature that could be reliably incorporated into this study. Thus, we only included their reconstructed body mass and EQ,..”. Now found in lines 393-397.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 65/69 - Perchalski et al. 2021 is a single-author publication, so no et al. or w/ colleagues.

      Indeed. This has been corrected in the manuscript, now found in lines 65 and 70.

      (2) Lines 96-98 - Is it appropriate to say that the use of symmetrical gaits are examples of convergent evolution? There's less burden of evidence to state that these are shared behaviors, rather than suggesting they independently evolved across all those groups.

      We agree with this and corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      (3) Line 198 - I am confused by how to interpret (-16,36 %) compared to how other numbers are presented in the rest of the paragraph.

      To avoid confusion, we rephrased this sentence such as: “In contrast, primates did not significantly reduce their speed compared to ascents when descending sideways or tail-first (Fig. 2A, SuppTables B).”  Now found in lines 207-209.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review):

      Summary:

      In this study, the authors aim to understand how Rhino, a chromatin protein essential for small RNA production in fruit flies, is initially recruited to specific regions of the genome. They propose that asymmetric arginine methylation of histones, particularly mediated by the enzyme DART4, plays a key role in defining the first genomic sites of Rhino localization. Using a combination of inducible expression systems, chromatin immunoprecipitation, and genetic knockdowns, the authors identify a new class of Rhinobound loci, termed DART4 clusters, that may represent nascent or transitional piRNA clusters.

      Strengths:

      One of the main strengths of this work lies in its comprehensive use of genomic data to reveal a correlation between ADMA histones and Rhino enrichment at the border of known piRNA clusters. The use of both cultured cells and ovaries adds robustness to this observation. The knockdown of DART4 supports a role for H3R17me2a in shaping Rhino binding at a subset of genomic regions.

      Weaknesses:

      However, Rhino binding at, and piRNA production from, canonical piRNA clusters appears largely unaffected by DART4 depletion, and spreading of Rhino from ADMArich boundaries was not directly demonstrated. Therefore, while the correlation is clearly documented, further investigation would be needed to determine the functional requirement of these histone marks in piRNA cluster specification.

      The study identify piRNA cluster-like regions called DART4 clusters. While the model proposes that DART4 clusters represent evolutionary precursors of mature piRNA clusters, the functional output of these clusters remains limited. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwi-dependent silencing.

      In summary, the authors present a well-executed study that raises intriguing hypotheses about the early chromatin context of piRNA cluster formation. The work will be of interest to researchers studying genome regulation, small RNA pathways, and the chromatin mechanisms of transposon control. It provides useful resources and new candidate loci for follow-up studies, while also highlighting the need for further functional validation to fully support the proposed model.

      We sincerely thank Reviewer #1 for the thoughtful and constructive summary of our work. We appreciate the reviewer’s recognition that our study provides a comprehensive analysis of the relationship between ADMA-histones and Rhino localization, and that it raises intriguing hypotheses about the early chromatin context of piRNA cluster formation.

      We fully agree with the reviewer that our data primarily demonstrate correlation between ADMA-histones and Rhino localization, rather than direct causation. In response, we have carefully revised the text throughout the manuscript to avoid overstatements implying causality (details provided below).

      We also acknowledge the reviewer’s important point that the functional requirement of ADMA-histones for piRNA clusters specification remains to be further established. We have now added the discussion about our experimental limitations (page 18).

      Overall, we have revised the manuscript to present our findings more cautiously and transparently, emphasizing that our data reveal a correlation between ADMA-histone marks and the initial localization of Rhino, rather than proving a direct mechanistic requirement. We thank the reviewer again for highlighting these important distinctions.

      Reviewer #2 (Public review):

      This study seeks to understand how the Rhino factor knows how to localize to specific transposon loci and to specific piRNA clusters to direct the correct formation of specialized heterochromatin that promotes piRNA biogenesis in the fly germline. In particular, these dual-strand piRNA clusters with names like 42AB, 38C, 80F, and 102F generate the bulk of ovarian piRNAs in the nurse cells of the fly ovary, but the evolutionary significance of these dual-strand piRNA clusters remains mysterious since triple null mutants of these dual-strand piRNA clusters still allows fly ovaries to develop and remain fertile. Nevertheless, mutants of Rhino and its interactors Deadlock, Cutoff, Kipferl and Moonshiner, etc, causes more piRNA loss beyond these dual-strand clusters and exhibit the phenotype of major female infertility, so the impact of proper assembly of Rhino, the RDC, Kipferl etc onto proper piRNA chromatin is an important and interesting biological question that is not fully understood.

      This study tries to first test ectopic expression of Rhino via engineering a Dox-inducible Rhino transgene in the OSC line that only expresses the primary Piwi pathway that reflects the natural single pathway expression the follicle cells and is quite distinct from the nurse cell germline piRNA pathway that is promoted by Rhino, Moonshiner, etc. The authors present some compelling evidence that this ectopic Rhino expression in OSCs may reveal how Rhino can initiate de novo binding via ADMA histone marks, a feat that would be much more challenging to demonstrate in the germline where this epigenetic naïve state cannot be modeled since germ cell collapse would likely ensue. In the OSC, the authors have tested the knockdown of four of the 11 known Drosophila PRMTs (DARTs), and comparing to ectopic Rhino foci that they observe in HP1a knockdown (KD), they conclude DART1 and DART4 are the prime factors to study further in looking for disruption of ADMA histone marks. The authors also test KD of DART8 and CG17726 in OSCs, but in the fly, the authors only test Germ Line KD of DART4 only, they do not explain why these other DARTs are not tested in GLKD, the UAS-RNAi resources in Drosophila strain repositories should be very complete and have reagents for these knockdowns to be accessible.

      The authors only characterize some particular ADMA marks of H3R17me2a as showing strong decrease after DART4 GLKD, and then they see some small subset of piRNA clusters go down in piRNA production as shown in Figure 6B and Figure 6F and Supplementary Figure 7. This small subset of DART4-dependent piRNA clusters does lose Rhino and Kipferl recruitment, which is an interesting result.

      However, the biggest issue with this study is the mystery that the set of the most prominent dual-strand piRNA clusters. 42AB, 38C, 80F, and 102F, are the prime genomic loci subjected to Rhino regulation, and they do not show any change in piRNA production in the GLKD of DART4. The authors bury this surprising negative result in Supplementary Figure 5E, but this is also evident in no decrease (actually an n.s. increase) in Rhino association in Figure 5D. Since these main piRNA clusters involve the RDC, Kipferl, Moonshiner, etc, and it does not change in ADMA status and piRNA loss after DART4 GLKD, this poses a problem with the model in Figure 7C. In this study, there is only a GLKD of DART4 and no GLKD of the other DARTs in fly ovaries.

      One way the authors rationalize this peculiar exception is the argument that DART4 is only acting on evolutionarily "young" piRNA clusters like the bx, CG14629, and CG31612, but the lack of any change on the majority of other piRNA clusters in Figure 6F leaves upon the unsatisfying concern that there is much functional redundancy remaining with other DARTs not being tested by GLKD in the fly that would have a bigger impact on the other main dual-strand piRNA clusters being regulated by Rhino and ADMA-histone marks.

      Also, the current data does not provide convincing enough support for the model Figure 7C and the paper title of ADMA-histones being the key determinant in the fly ovary for Rhino recognition of the dual-strand piRNA clusters. Although much of this study's data is well constructed and presented, there remains a large gap that no other DARTs were tested in GLKD that would show a big loss of piRNAs from the main dual-strand piRNA clusters of 42AB, 38C, 80F, and 102F, where Rhino has prominent spreading in these regions.

      As the manuscript currently stands, I do not think the authors present enough data to conclude that "ADMA-histones [As a Major new histone mark class] does play a crucial role in the initial recognition of dual-strand piRNA cluster regions by Rhino" because the data here mainly just show a small subset of evolutionarily young piRNA clusters have a strong effect from GLKD of DART4. The authors could extensively revise the study to be much more specific in the title and conclusion that they have uncovered this very unique niche of a small subset of DART4-dependent piRNA clusters, but this niche finding may dampen the impact and significance of this study since other major dual-strand piRNA clusters do not change during DART4 GLKD, and the authors do not show data GLKD of any other DARTs. The niche finding of just a small subset of DART-4-dependent piRNA clusters might make another specialized genetics forum a more appropriate venue.

      We are deeply grateful to Reviewer #2 for the detailed and insightful review that carefully situates our study in the broader context of Rhino-mediated piRNA cluster regulation. We appreciate the reviewer’s recognition that our inducible Rhino expression system in OSCs provides a valuable model to explore de novo Rhino recruitment under a simplified chromatin environment.

      At the same time, we agree that the current data mainly support a role for DART4 in regulating a subset of evolutionarily young piRNA clusters, and do not demonstrate a requirement for ADMA-histones at the major dual-strand piRNA clusters such as 42AB or 38C. We have therefore revised the title and main conclusions to more accurately reflect the scope of our findings.

      We agree with the reviewer that functional redundancy among DARTs may explain why major dual-strand piRNA clusters are unaffected by DART4 GLKD. Indeed, we have tried DART1 GLKD in the germline, which shows collapse of Rhino foci in OSCs.For DART1 GLKD, two approaches were possible:

      (1) Crossing the BDSC UAS-RNAi line (ID: 36891) with nos-GAL4.

      (2) Crossing the VDRC UAS-RNAi line (ID: 110391) with nos-GAL4 and UAS-Dcr2.

      The first approach was not feasible because the UAS-RNAi line always arrived as dead on arrival (DOA) and could not be maintained in our laboratory. The second approach did not yield effective and stable knockdown (as follows).

      DART8 and CG17726 did not alter Rhino foci in OSC knockdown experiments; therefore, we did not attempt germline knockdown (GLKD) of these DARTs in the ovary.  We agree with the reviewer’s opinion that there are piRNA source loci where Rhino localization depends on DART1, and that simultaneous depletion of multiple DARTs may indeed reveal additional positive results because ADMA-histones such as H3R8me2a may be completely eliminated by the knockdown of multiple DARTs. At the same time, we note that many evolutionarily conserved piRNA clusters show a loss of ADMA accumulation compared with evolutionarily young piRNA clusters, with levels that are comparable to the background input in ChIP-seq reads. Therefore, conserved clusters such as 42AB and 38C may no longer be regulated by ADMA. Even if multiple DARTs function redundantly to regulate ADMA, it may be difficult to disrupt Rhino localization at such conserved piRNA clusters by depletion of DARTs. While disruption of Rhino localization at conserved clusters like 42AB and 38C may be challenging, we cannot exclude the possibility that DART depletion affects Rhino binding at less conserved piRNA clusters, where ADMA modification remains detectable. We added clarifications in the Discussion to acknowledge the potential redundancy with other DARTs and to note that further knockdown experiments in the germline will be necessary to test this model comprehensively (page 18).

      We appreciate the reviewer’s critical feedback, which has helped us refine the message and strengthen the interpretative balance of the paper.

      Reviewer #1 (Recommendations for the authors):

      In multiple places, the link between ADMA histones and Rhino recruitment is presented in terms that imply causality. Please revise these statements to reflect that, in most cases, the evidence supports correlation rather than direct functional necessity. Similarly, statements suggesting that ADMA histones promote Rhino spreading should be revised unless supported by direct evidence.

      We sincerely thank the reviewer for the insightful comments. We recognize that these suggestions are crucial for improving the manuscript, and we have revised it accordingly to address the concerns. The specific revisions we made are detailed below.

      (1) Page 1, line 14: The original sentence “in establishing the sites” was changed to “may establish the potential sites.”

      (2) Page 4, lines 11-12: The original sentence “genomic regions where Rhino binds at the ends and propagates in the areas in a DART4-dependent manner, but not stably anchored” was changed to “genomic regions that have ADMA-histones at their ends and exhibit broad Rhino spreading across their internal regions in a DART4dependent manner”

      (3) Page4, lines 12-15: The original sentence “Kipferl is present at the regions but not sufficient to stabilize Rhino-genomic binding after Rhino propagates.” was changed to “In contrast to authentic piRNA clusters, Kipferl was lost together with Rhino upon DART4 depletion in these regions, suggesting that Kipferl by itself is not sufficient to stabilize Rhino binding; rather, their localization depends on DART4.”

      (4) Page4, lines17-18: The original sentence “are considered to be primitive clusters” was changed to “might be nascent dual-strand piRNA source loci”.

      (5) Page 8, line 7: The original sentence “Involvement of ADMA-histones in the genomic localization of Rhino was implicated.” was changed to “Correlation of ADMA-histones in the genomic localization of Rhino was implicated.”

      (6) Page 8, lines 19-21: The original sentence “These results suggest that ADMAhistones, together with H3K9me3, contribute significantly and specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.” was changed to “These results raise the possibility that ADMA-histones, together with H3K9me3, may contribute specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.”

      (7) Page 10, lines 11-13: The original sentence “These results suggest that DART1 and DART4 are involved in Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).” was changed to ”These results suggest that DART1 and DART4 could contribute to Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).”

      (8) Page 13, line 2: The original sentence “Genomic regions where Rhino spreads in a DART4-dependent manner, but not stably anchored, produce some piRNAs“ was changed to “Genomic regions where Rhino binds broadly in a DART4-dependent manner, but not stably anchored, produce some piRNAs”

      (9) Page 13, lines 21-22: The original sentence “These results support the hypothesis that ADMA-histones are involved in the genomic binding of Rhino both before and after Rhino spreading, resulting in stable genome binding.” was changed to “These results raise the possibility that a subset of Rhino localized to genomic regions correlating with ADMA-histones may serve as origins of spreading.”

      (10) Page 16, lines 6-8: The original sentence “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., ADMA-histones) play a crucial role in the loading of Rhino onto the genome.” was changed to “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., bivalent nucleosomes containing H3K9me3 and ADMA-histones) appear to contribute to the initial loading of Rhino onto the genome.”

      (11) Page16, line 12: The original sentence “We propose that the process of piRNA cluster formation begins with the initial loading of Rhino onto bivalent nucleosomes containing H3K9me3 and ADMA-histones (Fig. 7C). In OSCs, the absence of Kipferl and other necessary factors means that Rhino loading into the genome does not proceed to the next step.” was removed.

      Major points

      (1)  Clarify the limited colocalization between Rhino and H3K9me3 in OSCs. The observation that FLAG-Rhino foci show minimal overlap with H3K9me3 in OSCs appears inconsistent with the proposed model by the authors in the discussion, in which Rhino is initially recruited to bivalent nucleosomes bearing both H3K9me3 and ADMA marks. This discrepancy should be addressed. 

      We thank the reviewer’s insightful comments. Indeed, ChIP-seq shows that Rhino partially overlaps with H3K9me3 (Fig. 1F), but immunofluorescence did not reveal any detectable overlap (Fig. 1A). We interpret this discrepancy as arising from the fact that immunofluorescence primarily visualizes H3K9me3 foci that are localized as broad domains in the genome, such as those at centromeres, pericentromeres, or telomeres (named chromocenters), whereas the sharp and interspersed H3K9me3 signals along chromosome arms are difficult to detect by immunofluorescence. We now have these explanations in the revised text (page 6).

      (2)  Please indicate whether the FLAG-Rhino used in OSCs has been tested for functionality in vivo-for example, by rescuing Rhino mutant phenotypes. This is particularly relevant given that no spreading is observed with this construct.

      We thank the reviewer for raising this important point. We have not directly tested the functionality of FLAG-Rhino construct used in OSCs in living Drosophila fly; i.e., it has not been used to rescue Rhino mutant phenotypes in flies. We acknowledge that FLAGRhino has not previously been expressed in OSCs, and that its localization pattern in OSCs differs from that observed in ovaries, where Rhino is endogenously expressed. However, several lines of evidence suggest that the addition of the N-terminal FLAG tag is unlikely to compromise Rhino function

      (1) In previous studies, N-terminally tagged Rhino (e.g., 3xFLAG-V5-Precision-GFPRhino) was expressed in a living Drosophila ovary and was shown to localize properly to piRNA clusters, indicating that the tag does not prevent Rhino from binding its genomic targets (Baumgartner et al., 2022; eLife. Fig. 3 supplement 1G).

      (2) In Drosophila S2 cells, FLAG-tagged tandem Rhino chromodomains construct was shown to bind H3K9me3/H3K27me3 bivalent chromatin, demonstrating that the FLAG tag does not impair this fundamental chromatin interaction (Akkouche et al., 2025; Nat Struct Mol Biol. Fig. 4b).

      (3) GFP-tagged Rhino has been demonstrated to rescue the transposon derepression phenotype of Rhino mutant flies, further supporting that the addition of tags does not abolish its in vivo function. (Parhad et al., 2017; Dev Cell. Fig.1D).

      Therefore, we interpret the partial localization of FLAG-Rhino in OSCs as reflecting the specific chromatin environment and regulatory context of OSCs rather than functional impairment due to the FLAG tag.

      (3) Given the low levels of piRNA production and the absence of measurable effects on transposon expression or fertility upon DART4 knockdown, the rationale for classifying these regions as piRNA clusters should be clearly stated. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwidependent silencing. The authors should also consider and discuss the possibility that some of these differences may reflect background-specific genomic variation rather than DART4-dependent regulation per see.

      We thank the reviewer for the insightful comments. As noted, DART4 knockdown did not measurably affect transposon expression or fertility. piRNAs generated from DART4associated clusters associate with Piwi but are insufficient for target repression. Although loss of DART4 largely eliminated piRNAs from these clusters, the cluster-derived transcripts themselves were unchanged. To clarify this point, we now refer to these regions as DART4-dependent piRNA-source loci (DART4 piSLs) in the revised text. We also acknowledge that some observed differences may reflect strain-specific genomic variation and have added this caveat on page 16.

      (4)  The authors should describe the genomic context of DART4 clusters in more detail. Specifically, it would be helpful to indicate whether these regions overlap with known transposable elements, gene bodies, or intergenic regions, and to report the typical size range of the clusters. Are any of the piRNAs produced from these clusters predicted to target known transcripts? 

      We thank the reviewer’s insightful comments. The overlap of DART4 piSL with transposable elements, gene bodies, and intergenic regions is shown in the right panel of Supplementary Fig. 6E (denoted as “Rhino reduced regions in DART4 GLKD” in the figure). The typical size range of these clusters is presented in Supplementary Fig. 6G. The annotation of piRNA reads derived from these piSL is shown in the right panel of Supplementary Fig. 6F, indicating that most of them appear to target host genes. The specific genes and transposons matched by the piRNAs produced from DART4 piSL are listed in Supplementary Table 8.

      (5)  While correlations between Rhino and ADMA histone marks (especially H3R8me2a,H3R17me2a, H4R3me2a) are robust, many ADMA-enriched regions do not recruit Rhino. Please discuss this observation and consider the possible involvement of additional factors.

      We thank the reviewer’s insightful comments. As pointed out, not all ADMA-enriched regions recruit Rhino; rather, Rhino is recruited only at sites where ADMAs overlap with H3K9me3. Furthermore, the combination of H3K9me3 and ADMAs alone does not fully account for the specificity of Rhino recruitment, suggesting the involvement of additional co-factors (for example, other ADMA marks such as H3R42me2a, or chromatininteracting proteins). In addition, since histone modifications—including arginine methylation—have the possibility that they are secondary consequences of modifications on other proteins rather than primary regulatory events, it is possible that DART1/4 contribute to Rhino recruitment not only through histone methylation but also via arginine methylation of non-histone chromatin-interacting factors. However, methylation of HP1a does not appear to be involved (Supplementary Fig. 3G). We have added new sentences about these points in the Discussion section (page 18).

      (6) The manuscript states that Kipferl is present at DART4 clusters but does not stabilize Rhino binding. Please specify which experimental results support this conclusion and explain.

      We apologize for the lack of clarity regarding Kipferl data. Supplementary Fig. 7A and 7B show that Kipferl localizes at major DART4 piSL. This Kipferl localization is lost together with Rhino upon DART4 GLKD, indicating that Rhino localization at DART4 piSL depends on DART4 rather than on Kipferl. From these results, we infer that, unlike at authentic piRNA clusters, Kipferl may not be sufficient to stabilize the association of Rhino with the genome at DART4 piSL. We have added this interpretation on page 14.

      Minor points

      (1) Figure 1D: Please specify which piRNA clusters are included in the metaplot - all clusters, or only the major producers? 

      We thank the reviewer for the question. The metaplot was not generated from a predefined list of “all” piRNA clusters or only the “major producers.” Instead, it was constructed from Rhino ChIP–seq peaks (“Rhino domains”) that are ≥1.5 kb in length.These Rhino domains mainly correspond to the subregions within major dual-strand clusters (e.g., 42AB, 38C) as well as additional clusters such as 80F, 102F, and eyeless, among others. We have provided the full list of domains and their corresponding piRNA clusters (with genomic coordinates) in Supplementary Table 9 and added the additional explanation in Fig. 1d legend.

      (2) Supplemental Figure 5E is referred to as 5D in the main text.

      We corrected the figure citations on pages 11-12: the reference to Supplementary Fig. 5E has been changed to 5D, and the reference to Supplementary Fig. 5F has been changed to 5E.

      (3) Supplemental Figure 7C: The color legend does not match the pie chart, which may confuse readers.

      We thank the reviewer for the helpful comment. We are afraid we were not entirely sure what specific aspect of the legend was confusing, but to avoid any possible misunderstanding, we revised Supplemental Fig. 7C so that the color boxes in the legend now exactly match the corresponding colors in the pie chart. We hope this modification improves clarity.

      (4) Since the manuscript focuses on the roles of DART1 and DART4, including their expression profiles in OSCs and ovaries would help contextualize the observed phenotypes. Please consider adding this information if available.

      We thank the reviewer for the suggestion. We have now included a scatter plot comparing RNA-seq expression in OSCs and ovaries (Supplementary Fig. 3H). In these datasets, DART1 is strongly expressed in both tissues, whereas DART4 shows no detectable reads. Notably, ref. 28 reports strong expression of both DART1 and DART4 in ovaries by western blot and northern blot. In our own qPCR analysis in OSCs, DART4 expression is about 3% of DART1, which, although low, may still be sufficient for functional roles such as modification of H3R17me2a (Fig. 3C, Supplementary Fig. 3F and 3I). We have added these new data and additional explanation in the revised manuscript (page 11).

      (5) Several of the genome browser snapshots, particularly scale and genome coordinates, are difficult to read. 

      We apologize for the difficulty in reading several of the genome browser snapshots in the original submission. We have re-generated the relevant figures using IGV, which provides clearer visualization of scale and genome coordinates. The previous images have been replaced with the improved versions in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors need to elaborate on what this sentence means, as it is very unclear what they are describing about Rhino residency: "The results show that Rhino in OSCs tends to reside in the genome where Rhino binds locally in the ovary (Fig. 1C)." 

      We apologize for the lack of clarity in the original sentence. The text has been revised as follows:

      ”Rhino expressed in OSCs bound predominantly to genomic sites exhibiting sharp and interspersed Rhino localization patterns in the ovary, while showing little localization within broad Rhino domains, including major piRNA clusters.”

      In addition, to clarify the behavior of Rhino at broad domains, we have added the phrase “the terminal regions of broad domains, such as major piRNA clusters” to the subsequent sentence.

      (2) The red correlation line is very confusing in Figure 5F. What sort of line does this mean in this scatter plot? 

      We apologize for the lack of clarity regarding the red line in Fig. 5F. The red line represents the least-squares linear regression fit to the data points, calculated using the lm() function in R, and was added with abline() to illustrate the correlation between ctrl GLKD and DART4 GLKD values. In the revised figure, we have clarified this in the legend by specifying that it is a regression line.

      (3) There is no confirmation of the successful knockdown of the various DARTs in the OSCs.

      We thank the reviewer for the comment. The knockdown efficiency of the various DARTs in OSCs was confirmed by RT–qPCR. The data are now shown in Supplementary Fig. 3J. 

      (4) What is the purpose of an unnumbered "Method Figure" in the supplementary data file? Why not just give it a number and mention it properly in the text? 

      We thank the reviewer for the suggestion. We have now assigned a number to the previously unnumbered "Method Figure" and have included it as Supplementary Fig. 9.

      The figure is now properly cited in the Methods section.

      (5) For Figure 5A, those fly strain numbers in the labels are better reserved in the Methods, and a more appropriate label is to describe the GAL4 driver and the UAS-RNAi construct by their conventional names.

      We thank the reviewer for the suggestion. The labels in Fig. 5A have been updated to use the conventional names of the GAL4 drivers and UAS-RNAi constructs. Specifically, they now read Ctrl GLKD (nos-GAL4 > UAS-emp) and DART4 GLKD (nos-GAL4 > UASDART4). The original fly strain numbers are listed in the Methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents the potentially interesting concept that LRRK2 regulates cellular BMP levels and their release via extracellular vesicles, with GCase activity further modulating this process in mutant LRRK2-expressing cells. However, the evidence supporting the conclusions remains incomplete, and certain statistical analyses are inadequate. This work would be of interest to cell biologists working on Parkinson's disease.

      Reviewer #1 (Public review):

      Summary:

      Even though mutations in LRRK2 and GBA1 (which encodes the protein GCase) increase the risk of developing Parkinson's disease (PD), the specific mechanisms driving neurodegeneration remain unclear. Given their known roles in lysosomal function, the authors investigate how LRRK2 and GCase activity influence the exocytosis of the lysosomal lipid BMP via extracellular vesicles (EVs). They use fibroblasts carrying the PDassociated LRRK2-R1441G mutation and pharmacologically modulate LRRK2 and GCase activity.

      Strengths:

      The authors examine both proteins at endogenous levels, using MEFs instead of cancer cells. The study's scope is potentially interesting and could yield relevant insights into PD disease mechanisms.

      Weaknesses:

      Many of the authors' conclusions are overstated and not sufficiently supported by the data. Several statistical errors undermine their claims. Pharmacological treatment is very long, leading to potential off-target effects. Additionally, the authors should be more rigorous when using EV markers.

      We thank the reviewer for these valuable observations. In the revised manuscript, we have addressed each of these points as follows:

      (1) Conclusions and data support – We carefully revised our text throughout the manuscript to ensure that all conclusions are better supported by the presented data. For instance, we now explicitly state that while pharmacological modulation supports the regulatory role of LRRK2 activity in EV-mediated BMP release, we have softened our conclusions concerning the contribution of GCase in this model (see revised Results and Discussion sections).

      (2) Statistical analyses – We reanalyzed experiments involving more than two groups and replaced simple t-tests with non-parametric Kruskal-Wallis tests followed by Dunn’s post hoc comparisons. This approach, described in the updated figure legends (e.g., Figure 2D-F and H-J), provides a more rigorous statistical framework that accounts for small sample sizes and variability typical of EV quantifications.

      (3) Pharmacological treatment duration – Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115),Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).  In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are timedependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.  We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (4) EV markers – We and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). Moreover, LAMP proteins have been reported to be more enriched in EVs of endolysosomal origin (Mathieu et al., 2021). To further strengthen this point, we performed new experiments using a CD63-pHluorin sensor combined with TIRF microscopy, which allowed real-time visualization of CD63-positive exosome release. These new data (now presented in Figure 7, Panels G-I; Videos 1 and 2) confirm increased CD63-positive EV release in LRRK2 mutant fibroblasts, which was reversed by LRRK2 inhibition with MLi-2. The CD63-positive compartment was also largely BMPpositive (new Figure 7D, F, G), reinforcing our conclusions and providing additional rigor in EV marker validation.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors used MEFs expressing the R1441G mutant of leucine-rich repeat kinase 2 (LRRK2), a mutant associated with the early onset of Parkinson's disease. They report that in these cells LAMP2 fluorescence is higher but BMP fluorescence is lower, MVE size is reduced, and that MVEs contain less ILVs. They also report that LAMP2-positive EVs are increased in mutant cells in a process sensitive to LRRK2 kinase inhibition but are further increased by glucocerebrosidase (GCase) inhibition, and that total di-22:6-BMP and total di-18:1-BMP are increased in mutant LRRK2 MEFs compared to WT cells by mass spectrometry. They also report that LRRK2 kinase inhibition partially restores cellular BMP levels, and that GCase inhibition further increases BMP levels, and that in EVs from the LRRK2 mutant, LRRK2 inhibition decreases BMP while GCase inhibition has the opposite effect. Moreover, they report that the BMP increase is not due to increased BMP synthesis, although the authors observe that CLN5 is increased in LRRK2 mutant cells. Finally, they report that GW4869 decreases EV release and exosomal BMP, while bafilomycin A1 increases EV release. They conclude that LRRK2 regulates BMP levels (in cells) and release (via EVs). They also conclude that the process is modulated by GCase in LRRK2 mutant cells, and that these studies may contribute to the use of BMP-positive EVs as a biomarker for Parkinson's disease and associated treatments.

      Strengths:

      This is an interesting paper, which provides novel insights into the biogenesis of exosomes with exciting biomedical potential. However, I have comments that authors need to address to clarify some aspects of their study.

      Weaknesses:

      (1) The intensity of LAMP2 staining is increased significantly in cells expressing the R1441G mutant of LRRK2 when compared to WT cells (Figure 1C). Yet mutant cells contain significantly smaller MVEs with fewer ILVs, and the MVE surface area is reduced (Figure 1D-F). This is quite surprising since LAMP2 is a major component of the limiting membrane of late endosomes. Are other proteins of endo-lysosomes (eg, LAMP1, CD63, RAB7) or markers (lysotracker) also decreased (see also below)?

      As referenced in our original manuscript, several previous studies have reported endolysosomal morphological and homeostatic defects in cells harboring pathogenic LRRK2 mutations. LAMP2 can be upregulated as part of a lysosomal biogenesis or stress response (e.g., via MiT/TFE transcription factors such as TFEB; Sardiello et al., Science 2009, 325:473-477), whereas ILV biogenesis is primarily controlled by ESCRT- and SMPD3-dependent pathways that are regulated independently of MiT/TFE-driven transcriptional programs. Indeed, Stuffers et al. (Traffic 2009, 10:925-937) demonstrated that depletion of key ESCRT subunits markedly inhibited ILV formation while concomitantly increasing LAMP2 expression, highlighting the mechanistic dissociation between LAMP2 abundance and ILV number. In our study, we observed a similar pattern in R1441G LRRK2 MEFs, in which elevated LAMP2 staining and protein levels occurred despite a reduction in MVE size and ILV number. We interpret this as a compensatory lysosomal biogenesis response.

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy donors and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, we observed a consistent decrease in BMP immunostaining intensity (New Figure 7, Panel A and B), in agreement with our findings in mouse fibroblasts. We therefore propose that the elevated LAMP2 expression observed in the engineered MEF clone expressing R1441G may reflect a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. We have updated the Results and Discussion section of the manuscript to incorporate and clarify these findings.

      (2) LRRK2 has been reported to interact with endolysosomal membranes. Does the R1441G mutant bind LAMP2- and/or BMP-positive membranes? 

      We agree that LRRK2 has been reported to associate dynamically with endolysosomal membranes, particularly under conditions of endolysosomal stress or damage (Eguchi T, et al. PNAS 2018, 115:E9115-E9124; Bonet-Ponce L, et al. Sci Adv. 2020, 6:eabb2454; Wang X, et al. Elife. 2023, 12:e87255).

      Nevertheless, to explore whether LRRK2 associates with BMP-positive endolysosomes, we performed subcellular fractionation followed by biochemical analysis of endolysosomal fractions, since our available LRRK2 antibodies did not provide reliable immunofluorescence signals. These experiments were carried out using human skin fibroblasts derived from both healthy controls and Parkinson’s disease patients carrying the LRRK2-G2019S mutation. In both control and mutant fibroblasts, a pool of LRRK2 was detected in fractions positive for the BMP synthase CLN5 and the endolysosomal marker CD63 (New Supplementary Figure 4, Panel A), supporting the localization of LRRK2 to endolysosomal membranes that are likely BMP-enriched. Our manuscript’s Results and Methods sections have been updated accordingly.

      Does the mutant affect endolysosomes?

      As referenced in our original manuscript, several studies have reported that pathogenic LRRK2 mutations can lead to endolysosomal defects. Consistent with these reports, we also observed morphological alterations in endolysosomes of cells expressing mutant LRRK2, including reduced MVE size and fewer ILVs, as shown in Figure 1D–F. These observations are in agreement with previously described phenotypes associated with pathogenic LRRK2 variants. Furthermore, in mutant LRRK2 MEFs, and now in humanderived fibroblasts (see new Figure 7, Panel A and B), we observed a decrease in BMP immunostaining signal.

      (3) Immunofluorescence data indicate that BMP is decreased in mutant LRRK2expressing cells compared to WT (Figure 1A-B), but mass spec data indicate that di-22:6BMP and di-18:1-BMP are increased (Figure 3). Authors conclude that the BMP pool detected by mass spec in mutant cells is less antibody-accessible than that present in wt cells, or that the anti-BMP antibody is less specific and that it detects other analytes. This is an awkward conclusion, since the IF signal with the antibody is lower (not higher): why would the antibody be less specific? Could it be that the antibody does not see all BMP isoforms equally well? Moreover, the observations that mutant cells contain smaller MVEs (Figure 1D-F) with fewer ILVs are consistent with the IF data and reduced BMP amounts. This needs to be clarified.

      As previously reported by us (Lu et al., J Cell Biol 2022;221:e202105060) and others (Berg AL, et al. Cancer Lett. 2023, 557:216090), discrepancies can occur between BMP levels detected by immunofluorescence and those quantified by mass spectrometry. This is because immunostaining reflects the pool of antibody-accessible BMP, whereas lipidomics measures the total cellular content of all BMP molecular species, irrespective of their distribution or accessibility.

      We agree that the anti-BMP antibody may not detect all BMP isoforms equally well. Differences in acyl chain composition (such as the degree of saturation or chain length) can alter the stereochemistry of BMP and, consequently, epitope accessibility to antibody binding.

      In addition, in a personal communication with Monther Abu-Remaileh (Stanford University), we were informed that the antibody may also cross-react with other lipid species in endolysosomes. Nevertheless, since there is no formal evidence supporting this, we have removed the sentence in the Discussion section stating “Alternatively, the antibody may also detect non-BMP analytes” to avoid any potential misinterpretations. In its place, we have added a short statement noting that “not all BMP isoforms may be detected equally well”.

      Mass spectrometry data are only shown for two BMP species (di-22:6, di-18:1). What are the major BMP isoforms in WT cells? The authors should show the complete analysis for all BMP species if they wish to draw quantitative conclusions about the amounts of BMP in wt and mutant cells. Finally, BMP and PG are isobaric lipids. Fragmentation of BMPs or PGs results in characteristic fingerprints, but the presence of each daughter ion is not absolutely specific for either lipid. This should be clarified, e.g., were BMP and PG separated before mass spec analysis? Was PG affected? The authors should also compare the BMP data with mass spec data obtained with a control lipid, e.g., PC.

      Regarding BMP isoforms, our targeted UPLC-MS/MS analyses revealed that 2,2′-di-22:6-BMP (sn2/sn2′) and 2,2′-di-18:1-BMP (sn2/sn2′) are the predominant BMP isoforms in MEF cells, consistent with previous reports showing docosahexaenoyl (22:6; DHA) and oleoyl (18:1) BMP as the most abundant isoforms. Across diverse mammalian cells and tissues, BMP typically exhibits a fatty acid composition dominated by oleoyl, with polyunsaturated fatty acids (particularly DHA) also contributing substantially. Enrichment of DHA-containing BMP species has been observed in multiple systems, including rat uterine stromal cells, PC12 cells, THP-1 and RAW macrophages, as well as in rat and human liver. This consistent presence of oleoyl- and docosahexaenoyl-containing BMP species across tissues indicates that these acyl chains are conserved features influencing the lipid’s structural and functional characteristics (Kobayashi et al. J Biol Chem, 2002; Hullin-Matsuda et al. Prostaglandins Leukotriens Essent Fatty Acids, 2009; Thompson et al. Int J Toxicol. 2012; Delton-Vandenbroucke et al. J Lipid Res, 2019).

      Nevertheless, we have included a Table (Panel H in updated Supplemental Figure 1) showing other BMP species that were also detected in our lipidomics analysis. Overall, dioleoyl (18:1)- and di-docosahexaenoyl (22:6)-BMP species were the most abundant in MEF cells, whereas di-arachidonoyl (20:4)- and di-linoleoyl (18:2)-BMP isoforms were present at lower levels. Consistently, R1441G LRRK2 MEFs displayed higher levels of dioleoyl- and di-docosahexaenoyl-BMP compared with WT cells, and these elevations were reduced following LRRK2 kinase inhibition with MLi-2. Data from three independent representative experiments are shown, and the manuscript has been revised accordingly to include these results.

      Regarding the separation of BMP and PG species, we confirm that BMP and PG were chromatographically resolved prior to MS/MS detection using a validated UPLC-MS/MS method developed by Nextcea, Inc. PG exhibits a substantially longer LC retention time than BMP, ensuring complete baseline separation. This approach (established by Nextcea nearly two decades ago and later validated through a multi-year collaboration with the U.S. FDA to clinically qualify di-22:6-BMP as a biomarker) prevents any ambiguity arising from the isobaric nature of BMP and PG species. No changes in PG levels were detected under any experimental conditions.

      Finally, we employed isotope-labeled BMP as an internal standard to ensure robust normalization across samples. These additional details and references cited above have been included in the revised Methods and References sections to further clarify the analytical rigor of our lipidomics workflow.

      (4) It is quite surprising that the amounts of labeled BMP continue to increase for up to 24h after a short 25min pulse with heavy BMP precursors (Figure 4B).

      In these isotope-labeling experiments, it is important to note (as described in our original manuscript) that two distinct pools of metabolically labeled BMP species were detected: semi-labeled BMP (with only one heavy isotope-labeled fatty acyl chain) and fully-labeled BMP (with both fatty acyl chains labeled). We consider the fully-labeled BMP pool to provide the most reliable readout for BMP turnover, as it showed a rapid decline after a 1h chase (decreasing by more than 50% within 8 h in all conditions), reaching its lowest levels at the end of the 48-h chase period.

      The apparent increase in semi-labeled BMP species over time may be explained by continued incorporation of labeled precursors following the initial pulse. Specifically, once existing semi-labeled and fully-labeled BMP molecules are degraded by PLA2G15 (Nyame K, et al. Nature 2025, 642:474-483), the resulting isotope-labeled lysophosphatidylglycerol (LPG) and fatty acids could be recycled and re-enter a new round of BMP biosynthesis, leading to a gradual accumulation of semi-labeled BMP such as di-18:1-BMP. Why would this reasoning not also apply to the fully-labeled species? Once the pulse is completed, newly incorporated non-labeled fatty acyl chains present in the cellular pool can compete with labeled ones during subsequent rounds of lipid remodeling or synthesis. As a result, the probability of generating semi-labeled BMP molecules becomes higher than that of forming fully-labeled species. Consistent with this, our data show an increase in only semi-labeled BMP species (but not in fully-labeled ones) up to 24 hours after the pulse. We have added a clarification regarding this point in the revised manuscript.

      (5) It is argued that upregulation of CLN5 may be due to an overall upregulation of lysosomal enzymes, as LAMP2 levels were also increased (Figure 2A, C, E). Again, this is not consistent with the observed decrease in MVE size and number (Figure 1D-F). As mentioned above, other independent markers of endo-lysosomes should be analyzed (eg, LAMP1, CD63, RAB7), and/or other lysosomal enzymes (e.g. cathepsin. D).

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy controls and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, our results consistently show increased CLN5 protein levels in both mouse and human fibroblast cell lines harboring pathogenic LRRK2 mutations. Upregulation of CLN5 may reflect a compensatory effect from loss of BMP via EV exocytosis. As discussed above, the elevated LAMP2 signal observed in the engineered MEF clone expressing R1441G could represent a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. Our Results and Discussion sections have been updated accordingly.

      (6) The authors report that the increase in BMP is not due to an increase in BMP synthesis (Figure 4), although they observe a significant increase in CLN5 (Figure 5A) in LRRK2 mutant cells. Some clarification is needed.

      In our original manuscript, we proposed that although CLN5 protein levels are increased in R1441G LRRK2 MEFs, the absence of significant changes in BMP synthesis rates (Figure 4B, C) may reflect either limited substrate availability or that CLN5 is already operating near its maximal enzymatic capacity. Our new subcellular fractionation data (new Figure 7, Panel A) further indicate that, despite a relative increase in total CLN5 levels in G2019S LRRK2 human fibroblasts, the amount of CLN5 associated with endolysosomes remains comparable between mutant LRRK2 and control cells. This suggests that a considerable fraction of upregulated CLN5 may not localize to endolysosomes, potentially accumulating in the endoplasmic reticulum due to enhanced translation or impaired trafficking. Unfortunately, the available anti-CLN5 antibody did not yield reliable immunofluorescence signals, preventing us from directly confirming this possibility. Nevertheless, in light of our new data (new Supplemental Figure 4A), we have included a clarification in the revised manuscript discussing this possibility as well.

      (7) Authors observe that both LAMP2 and BMP are decreased in EVs by GW4869 and increased by bafilomycin (Figure 6). Given my comments above on Figure 1, it would also be nice to illustrate/quantify the effects of these compounds on cells by immunofluorescence.

      We appreciate the reviewer’s suggestion. We have previously published immunofluorescence data showing increased BMP accumulation in endolysosomes following treatment with bafilomycin A1 Lu A, et al. J Cell Biol. 2009, 184:863-879). However, in the present study, our lipidomics analyses revealed a decrease in both di22:6-BMP and di-18:1-BMP species in cells treated with this compound. As discussed above, this apparent discrepancy likely reflects methodological differences between immunofluorescence, which detects only antibody-accessible BMP pools, and lipidomics, which quantifies total cellular BMP content. 

      Moreover, in a recent study (Andreu Z, et al. Nanotheranostics 2023, 7:1-21), BMP levels were analyzed by immunofluorescence in cells treated with spiroepoxide, a potent and selective irreversible inhibitor of nSMase (different from GW4869) known to block EV release. Spiroepoxide-treated cells showed decreased BMP immunostaining; a result that, again, does not align with mass spectrometry data revealing increased cellular BMP levels upon GW4869 treatment. Notably, in that study, spiroepoxide was used instead of GW4869 because the intrinsic autofluorescence of GW4869 could potentially interfere with the immunofluorescence BMP signal.

      We therefore consider lipidomics measurements to provide a more reliable and quantitative representation of BMP dynamics under these conditions.

      Reviewer #1 (Recommendations for the authors):

      Major concerns:

      (1) 48 h for MLi2 treatment seems too long. LRRK2 kinase activity is inhibited with much shorter incubation times. The longer the incubation, the more likely off-target effects are. The authors should repeat these experiments with 1-2 h of MLi2.

      We thank the reviewer for this valuable comment. We acknowledge that MLi-2 is a potent and selective LRRK2 kinase inhibitor that achieves near-complete target engagement within a few hours of treatment. However, prolonged exposure has been widely used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have employed long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).

      In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are time-dependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.

      We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (2) Is there a reason why the authors don't include CD81, CD63, and Syntenin-1 in their study as an EV marker? Using solely Flotilin-1 does not seem to be enough to justify their claims.

      We actually used not only Flotillin-1 but also LAMP2 as EV markers in our study. While both Flotillin-1 and LAMP2 detection on EVs may vary depending on the cell type, we and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). In particular, one of these studies reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Therefore, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and reliably used to characterize small EVs.

      Nevertheless, to further address the reviewer’s concern, we performed additional experiments using a CD63-based fluorescence sensor (CD63-pHluorin), which, combined with TIRF microscopy, enables real-time visualization of CD63-positive exosome release. These experiments were conducted in control and LRRK2-mutant fibroblasts, and the data are presented in new Figure 7 (Panels G-I; Videos 1 and 2). We have also included all relevant references and clarified this point in the revised manuscript.

      (3) Indeed, to quantify the amount of certain proteins in EVs, the authors should normalize them by CD63 or CD81.

      Protein normalization in isolated EV fractions is indeed challenging. Although tetraspanins such as CD63 and CD81 are commonly enriched in EVs, their abundance can vary considerably across EV subpopulations, cell types, and experimental conditions, making them unreliable as universal normalization markers (Théry et al., J Extracell Vesicles, 2018; Margolis & Sadovsky, Nat Rev Mol Cell Biol, 2019).  Current guidelines from the International Society for Extracellular Vesicles (ISEV), as described in the Minimal Information for Studies of Extracellular Vesicles 2018 (MISEV2018; Théry C, et al. JExtracell Vesicles. 2018, 7:1535750) and updated in MISEV2024 (Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), recommend reporting multiple EV markers rather than relying on a single protein for normalization. They also suggest ensuring comparable experimental conditions by using the same number of cells at the start of the experiment and normalizing EV data to cell number or whole-cell lysate protein content at the end of the experiment, among other approaches.

      In our study, we normalized EV data to whole-cell lysate (WCL) protein content, as this approach accounts for differences in EV production due to variations in cell number or treatment conditions and is commonly used in the field (Kowal et al., PNAS, 2016; Mathieu et al., Nat Commun, 2021). We also included Flotillin-1 and LAMP2 as EV markers, both of which have been validated as molecular markers of small EV subpopulations.

      (4) Hyper normalization in WB quantification in Figure 2E-G is statistically incorrect, as it assumes that one group (in this case, R1441G ctrl) has no variability at all, which is not biologically possible. The authors should repeat the quantification without hypernormalizing one of their groups. This issue is prevalent across the whole manuscript.

      We understand the concern regarding “hyper-normalization” (i.e., expressing all values relative to one condition set to 1), which may mask variability in the reference group. However, it is standard practice in immunoblotting analysis to express data relative to a control condition for comparison, as variations in membrane transfer, exposure time, and signal development can differ across blots. In our case, the data are expressed as relative levels (arbitrary units) rather than absolute quantitative values. To facilitate comparison between datasets and account for inter-experimental variation, we continued to express values relative to the mutant LRRK2 MEF condition.

      On the other hand, in lipidomics experiments, despite using the same number of seeded cells and identical extraction and analysis protocols, minor biological and technical variability was observed across independent replicates. This variability is inherent to the experimental system and is now explicitly represented in the new table included in Supplemental Figure 1F, which compiles three independent representative lipidomics experiments showing quantitative BMP levels across different conditions.

      (5) The authors perform a t-test in Figure 2E-G when comparing more than 2 groups, which is wrong. The authors should use a two-way ANOVA as they are comparing genotype and treatment.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The statistical analyses and figure legend have been updated in the revised manuscript accordingly.

      In addition, since our CBE treatments yielded statistically non-significant data, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity to EV-mediated BMP release modulation.

      (6) There is a very strong reduction in flotillin-1 in R1441G cells vs WT (Figure 2G) in the EV fraction. That reduction is further exacerbated with MLi2, which likely means it is not kinase activity dependent. Can the authors comment on that?

      We agree with the reviewer that Flotillin-1 showed a different behavior compared with LAMP2 in these experiments. As recommended by the MISEV guidelines (Théry C, et al. J Extracell Vesicles. 2018;  7:1535750; Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), it is important to analyze more than one EV-associated protein marker. We examined LAMP2, which, together with LAMP1, has been reported to be specifically enriched in EVs of endolysosomal origin (exosomes; Mathieu et al., Nat Commun. 2021, 12:4389 ). In contrast, Flotillin-1 is also associated with small EVs but may represent a distinct EV subpopulation from those positive for LAMP proteins (Kowal J, et al. PNAS 2016, 113:E968-E977).

      Nevertheless, the biochemical analysis of isolated EV fractions was complemented by our lipidomics data and, in the revised version, by TIRF microscopy analysis of exosome release in control and G2019S LRRK2 human fibroblasts (new Figure 7, Panels G-I; Videos 1 and 2). In this analysis, we confirmed increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). Collectively, these findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion.

      (7) In Figure 2C, the authors should express that the LAMP2-EV and flotillin-1 EV fractions from the WB are highly exposed. As presently presented, it is slightly misleading.

      We thank the reviewer for this comment. In EV preparations, the amount of protein recovered is typically very low. Therefore, although we loaded all the EV protein obtained from each sample, the immunoblots for LAMP2 and Flotillin-1 in EV fractions required longer exposure times to visualize clear signals across all conditions. We have now indicated in the corresponding figure legend that these EV blots are long-exposure blots to facilitate signal detection and avoid any potential misunderstanding.

      (8) If Figure 2C and D are from two different experiments, they should not be plotted together in Figure 2E-G. You cannot compare the effect of MLi2 vs CBE if done in completely different experiments.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The revised statistical analyses and figure legends have been updated accordingly in the manuscript.

      (9) The authors state that "For the R1441G MEF cells, MLi-2 decreased EV concentration while CBE increased EV particles per ml, in agreement with the effects observed in our biochemical analysis." As Figure S1D shows no statistical significance, the authors don't have sufficient evidence to make this claim.

      We apologize for this overstatement. We have revised the text to clarify that, although the differences did not reach statistical significance, a consistent trend toward decreased EV concentration upon MLi-2 treatment and increased EV release following CBE treatment was observed in R1441G MEF cells.

      (10) "Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, D, F) and suggest a role for LRRK2 and GCase in modulating BMP release in association with LAMP2-positive exosomes from MEF cells." As Figure 3E shows no statistical difference of BMP on EVs upon CBE treatment, this sentence is not accurate and should be reframed. Furthermore, the authors claim an increase in EV-LAMP2 in R1441G cells compared to WT, however, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. This contradiction does not support the authors' conclusions and really puts into question their whole model.

      We thank the reviewer for this observation. After reanalyzing our biochemical data from isolated EV fractions (see new Panels D-F and H-J) using an improved statistical approach, we found that although EV-associated LAMP2 levels were consistently elevated in untreated R1441G LRRK2 MEFs compared to WT cells, CBE treatment only produced a non-significant trend toward increased EV-associated LAMP2 compared to untreated R1441G LRRK2 cells. Accordingly, we have revised the sentence to read as follows:

      “Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, E, G, I) and suggest that LRRK2 activity regulates BMP release in association with LAMP2positive exosomes, whereas GCase activity appears to have a more variable effect under the tested conditions.”

      We also agree with the reviewer that, in our MEF model, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EVassociated BMP and LAMP2 levels in R1441G LRRK2 MEFs, and our new data (new Figure 7, Panel G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G).

      In light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity in this model.

      (11) In Figure 5, 16 h of MLi2 treatment is too long and can lead to off-target effects. I would advise reducing it to 1-4 h.

      Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202). Moreover, the data presented in Figure 5 demonstrate a reduction in CLN5 protein levels in both MEFs and human fibroblasts following MLi-2 treatment, confirming the specificity of the observed effects in LRRK2 mutant cells.

      (12) "Our data suggest that BMP is exocytosed in association with EVs and that LRRK2 and GCase activities modulate BMP secretion." Again, cells carrying the R1441G mutation have the same amount of BMP in EVs than WT. This sentence is not factually accurate. Accordingly, CBE did not change the amount of BMP in EVs.

      We thank the reviewer for this observation and agree that, in our MEF model, the amount of BMP in EVs from R1441G LRRK2 cells is comparable to that observed in WT cells. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EV-associated BMP levels in R1441G LRRK2 MEFs, and our new data (new Figure 7G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin–positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). These findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion. In addition, in light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the paper concerning the contribution of GCase activity in this model.

      (13) Figure 6; EV release should have been monitored by more accurate markers such as CD63 and CD81.

      We thank the reviewer for this comment. We and others (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022) have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions. In particular, one of these studies (Mathieu et al., Nat Commun. 2021), in which bafilomycin A1 was also used (to boost exosome release), reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Altogether, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and accurately used to characterize EVs. We have now included all relevant references in the revised manuscript to further clarify this point.

      (14) Figure 6 suggests that exosomal BMP is controlled by EV release. I would think that is rather obvious.

      We agree that the finding that exosomal BMP release is influenced by EV secretion may appear “obvious.” However, our intention in Figure 6 was to provide direct experimental evidence confirming this relationship using pharmacological modulators of EV release. Specifically, inhibition of EV secretion with GW4869 reduced exosomal BMP levels, whereas stimulation with bafilomycin A1 increased them. These data were important to establish a causal link between EV trafficking and BMP export, thereby validating our model and supporting the interpretation that LRRK2 regulates BMP homeostasis through EV-mediated exocytosis, which is further modulated, to some extent, by GCase activity. 

      Minor concerns:

      (1) Figure 1: Change colors to be color blind friendly.

      We thank the reviewer for this helpful suggestion. We have adjusted the colors in Figure 1 to be color-blind friendly. In addition, we have applied the same color-blind friendly palette to the new immunofluorescence data presented in new Figure 7, Panel A and D.

      (2) More consistency on "Xmin" vs "X min" would be appreciated.

      We thank the reviewer for this observation. We have revised the manuscript to ensure consistent formatting of time indications throughout the text and figures, using the standardized format “X min.”

      Reviewer #2 (Recommendations for the authors):

      (1)  Figure 2C-D. Were equal amounts of protein loaded in each lane?

      Equal protein amounts were loaded in lanes corresponding to whole-cell lysate (WCL) fractions and normalized based on α-Tubulin levels.

      For the extracellular vesicle (EV) fractions, all protein recovered from EV pellets after isolation was loaded. In all EV-related experiments, we seeded the same number of EVproducing cells per condition, and the resulting EV-derived data (from both immunoblotting and lipidomics analyses) were normalized to the corresponding whole cell lysate (WCL) protein content to ensure comparability across conditions.

      All these technical details have been included in the Materials section of our revised manuscript.

      (2) The authors refer to the papers of Medoh et al (ref 43) and Singh et al. (44) for the key role of CLN5 in the BMP biosynthetic pathway. However, Medoh et al reported that CLN5 is the lysosomal BMP synthase. In contrast, Singh et al. reported that PLD3 and PLD4 mediate the synthesis of SS-BMP, and did not find any role for CLN5. 

      To avoid any confusion or misinterpretation of our findings regarding CLN5 and given that we do not analyze PLD3 or PLD4 in our study, we have decided to replace the reference to Singh et al. with Bulfon D. et al. (Nat. Commun. 2024, 15:9937) instead. This last work, conducted by an independent group distinct from the one that originally described CLN5, also validated CLN5 as the sole BMP synthase in cells.

      Also, authors mention that bafilomycin A1 (B-A1) dramatically boosts EV exocytosis, referring to Kowal et al., 2016 (ref 35) and Lu et al., 2018 (ref 45). However, this is not shown in Kowal et al.

      We thank the reviewer for pointing out this mistake. We apologize for the incorrect citation and have now corrected the reference. The statement regarding the effect of bafilomycin A1 on EV exocytosis now appropriately refers to Mathieu et al., 2021 and Lu et al., 2018.

      (3) Page 7, it is stated that "No statistically significant differences in intracellular BMP levels were observed in WT LRRK2 MEFs upon LRRK2 or GCase inhibition(Supplemental Figure 1D, E)". The authors probably mean "Supplemental Figure 1F, G"

      We thank the reviewer for noting this error. We have corrected the text to refer to panels F and G of Supplemental Figure 1, which correspond to the relevant data. We have also revised the reference to panel I of Supplemental Figure 1 accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) I have to admit that it took a few hours of intense work to understand this paper and to even figure out where the authors were coming from. The problem setting, nomenclature, and simulation methods presented in this paper do not conform to the notation common in the field, are often contradictory, and are usually hard to understand. Most importantly, the problem that the paper is trying to solve seems to me to be quite specific to the particular memory study in question, and is very different from the normal setting of model-comparative RSA that I (and I think other readers) may be more familiar with.

      We have revised the paper for clarity at all levels: motivation, application, and parameterization. We clarify that there is a large unmet need for using RSA in a trial-wise manner, and that this approach indeed offers benefits to any team interested in decoding trial-wise representational information linked to a behavioral responses, and as such is not a problem specific to a single memory study.

      (2) The definition of "classical RSA" that the authors are using is very narrow. The group around Niko Kriegeskorte has developed RSA over the last 10 years, addressing many of the perceived limitations of the technique. For example, cross-validated distance measures (Walther et al. 2016; Nili et al. 2014; Diedrichsen et al. 2021) effectively deal with an uneven number of trials per condition and unequal amounts of measurement noise across trials. Different RDM comparators (Diedrichsen et al. 2021) and statistical methods for generalization across stimuli (Schütt et al. 2023) have been developed, addressing shortcomings in sensitivity. Finally, both a Bayesian variant of RSA (Pattern component modelling, (Diedrichsen, Yokoi, and Arbuckle 2018) and an encoding model (Naselaris et al. 2011) can effectively deal with continuous variables or features across time points or trials in a framework that is very related to RSA (Diedrichsen and Kriegeskorte 2017). The author may not consider these newer developments to be classical, but they are in common use and certainly provide the solution to the problems raised in this paper in the setting of model-comparative RSA in which there is more than one repetition per stimulus.

      We appreciate the summary of relevant literature and have included a revised Introduction to address this bounty of relevant work. While much is owed to these authors, new developments from a diverse array of researchers outside of a single group can aid in new research questions, and should always have a place in our research landscape. We owe much to the work of Kriegeskorte’s group, and in fact, Schutt et al., 2023 served as a very relevant touchpoint in the Discussion and helped to highlight specific needs not addressed by the assessment of the “representational geometry” of an entire presented stimulus set. Principal amongst these needs is the application of trial-wise representational information that can be related to trial-wise behavioral responses and thus used to address specific questions on brain-behavior relationships. We invite the Reviewer to consider the utility of this shift with the following revisions to the Introduction.

      Page 3. “Recently, methodological advancements have addressed many known limitations in cRSA. For example, cross-validated distance measures (e.g., Euclidean distance) have improved the reliability of representational dissimilarities in the presence of noise and trial imbalance (Walther et al., 2016; Nili et al., 2014; Diedrichsen et al., 2021). Bayesian approaches such as pattern component modeling (Diedrichsen, Yokoi, & Arbuckle, 2018) have extended representational approaches to accommodate continuous stimulus features or temporal variation. Further, model comparison RSA strategies (Diedrichsen et al., 2021) and generalization techniques across stimuli (Schütt et al., 2023) have improved sensitivity and inference. Nevertheless, a common feature shared across most of improvements is that they require stimuli repetition to examine the representational structure. This requirement limits their ability to probe brain-behavior questions at the level of individual events”.

      Page 8. “While several extensions of RSA have addressed key limitations in noise sensitivity, stimulus variance, and modeling (e.g., Diedrichsen et al., 2021; Schütt et al., 2023), our tRSA approach introduces a new methodological step by estimating representational strength at the trial level. This accounts for the multi-level variance structure in the data, affords generalizability beyond the fixed stimulus set, and allows one to test stimulus- or trial-level modulations of neural representations in a straightforward way”.

      Page 44. “Despite such prevalent appreciation for the neurocognitive relevance of stimulus properties, cRSA often does not account for the fact that the same stimulus (e.g., “basketball”) is seen by multiple subjects and produces statistically dependent data, an issue addressed by Schütt et al., 2023, who developed cross validation and bootstrap methods that explicitly model dependence across both subjects and stimulus conditions”.

      (3) The stated problem of the paper is to estimate "representational strength" in different regions or conditions. With this, the authors define the correlation of the brain RDM with a model RDM. This metric conflates a number of factors, namely the variances of the stimulus-specific patterns, the variance of the noise, the true differences between different dissimilarities, and the match between the assumed model and the data-generating model. It took me a long time to figure out that the authors are trying to solve a quite different problem in a quite different setting from the model-comparative approach to RSA that I would consider "classical" (Diedrichsen et al. 2021; Diedrichsen and Kriegeskorte 2017). In this approach, one is trying to test whether local activity patterns are better explained by representation model A or model B, and to estimate the degree to which the representation can be fully explained. In this framework, it is common practice to measure each stimulus at least 2 times, to be able to estimate the variance of noise patterns and the variance of signal patterns directly. Using this setting, I would define 'representational strength" very differently from the authors. Assume (using LaTeX notation) that the activity patterns $y_j,n$ for stimulus j, measurement n, are composed of a true stimulus-related pattern ($u_j$) and a trial-specific noise pattern ($e_j,n$). As a measure of the strength of representation (or pattern), I would use an unbiased estimate of the variance of the true stimulus-specific patterns across voxels and stimuli ($\sigma^2_{u}$). This estimator can be obtained by correlating patterns of the same stimuli across repeated measures, or equivalently, by averaging the cross-validated Euclidean distances (or with spatial prewhitening, Mahalanobis distances) across all stimulus pairs. In contrast, the current paper addresses a specific problem in a quite specific experimental design in which there is only one repetition per stimulus. This means that the authors have no direct way of distinguishing true stimulus patterns from noise processes. The trick that the authors apply here is to assume that the brain data comes from the assumed model RDM (a somewhat sketchy assumption IMO) and that everything that reduces this correlation must be measurement noise. I can now see why tRSA does make some sense for this particular question in this memory study. However, in the more common model-comparative RSA setting, having only one repetition per stimulus in the experiment would be quite a fatal design flaw. Thus, the paper would do better if the authors could spell the specific problem addressed by their method right in the beginning, rather than trying to set up tRSA as a general alternative to "classical RSA".

      At a general level, our approach rests on the premise that there is meaningful information present in a single presentation of a given stimulus. This assumption may have less utility when the research goals are more focused on estimating the fidelity of signal patterns for RSA, as in designs with multiple repetitions. But it is an exaggeration to state that such a trial-wise approach cannot address the difference between “true” stimulus patterns and noise. This trial-wise approach has explicit utility in relating trial-wise brain information to trial-wise behavior, across multiple cognitions (not only memory studies, as applied here). We have added substantial text to the Introduction distinguishing cRSA, which is widely employed, often in cases with a single repetition per stimulus, and model comparative methods that employ multiple repetitions. We clarify that we do not consider tRSA an alternative to the model comparative approach, and discuss that operational definitions of representational strength are constrained by the study design.

      Page 3. “In this paper, we present an advancement termed trial-level RSA, or tRSA, which addresses these limitations in cRSA (not model comparison approaches) and may be utilized in paradigms with or without repeated stimuli”.

      Page 4. “Representational geometry usually refers to the structure of similarities among repeated presentations of the same stimulus in the neural data (as captured in the brain RSM) and is often estimated utilizing a model comparison approach, whereas representational strength is a derived measure that quantifies how strongly this geometry aligns with a hypothesized model RSM. In other words, geometry characterizes the pattern space itself, while representational strength reflects the degree of correspondence between that space and the theoretical model under test”.

      Finally, we clarified that in our simulation methods we assume a true underlying activity pattern and a random error pattern. The model RSM is computed based on the true pattern, whereas the brain RSM comes from the noisy pattern, not the model RSM itself.

      Page 9. “Then, we generated two sets of noise patterns, which were controlled by parameters σ<sub>A</sub> and σ<sub>B</sub> , respectively, one for each condition”.

      (4) The notation in the paper is often conflicting and should be clarified. The actual true and measured activity patterns should receive a unique notation that is distinct from the variances of these patterns across voxels. I assume that $\sigma_ijk$ is the noise variances (not standard deviation)? Normally, variances are denoted with $\sigma^2$. Also, if these are variances, they cannot come from a normal distribution as indicated on page 10. Finally, multi-level models are usually defined at the level of means (i.e., patterns) rather than at the level of variances (as they seem to be done here).

      We have added notations for true and measured activity patterns to differentiate it from our notation for variance. We agree that multilevel models are usually defined at the level of means rather than at the level of variances and we include a Figure (Fig 1D) that describes the model in terms of the means. We clarify that the σ ($\sigma$) used in the manuscript were not variances/standard deviations themselves; rather, they were meant to denote components of the actual (multilevel) variance parameter. Each component was sampled from normal distributions, and they collectively summed up to comprise the final variance parameter for each trial. We have modified our notation for each component to the lowercase letter s to minimize confusion. We have also made our R code publicly available on our lab github, which should provide more clarity on the exact simulation process.

      (5) In the first set of simulations, the authors sampled both model and brain RSM by drawing each cell (similarity) of the matrix from an independent bivariate normal distribution. As the authors note themselves, this way of producing RSMs violates the constraint that correlation matrices need to be positive semi-definite. Likely more seriously, it also ignores the fact that the different elements of the upper triangular part of a correlation matrix are not independent from each other (Diedrichsen et al. 2021). Therefore, it is not clear that this simulation is close enough to reality to provide any valuable insight and should be removed from the paper, along with the extensive discussion about why this simulation setting is plainly wrong (page 21). This would shorten and clarify the paper.

      We have added justification of the mixed-effects model given the potential assumption violations. We caution readers to investigate the robustness of their models, and to employ permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. Finally, we agree that the first simulation setting does not possess several properties of realistic RDMs/RSMs; however, we believe that there is utility in understanding the mathematical properties of correlations – an essential component of RSA – in a straightforward simulation where the ground truth is known, thus moving the simulation to Appendix 1.

      (6) If I understand the second simulation setting correctly, the true pattern for each stimulus was generated as an NxP matrix of i.i.d. standard normal variables. Thus, there is no condition-specific pattern at all, only condition-specific noise/signal variances. It is not clear how the tRSA would be biased if there were a condition-specific pattern (which, in reality, there usually is). Because of the i.i.d. assumption of the true signal, the correlations between all stimulus pairs within conditions are close to zero (and only differ from it by the fact that you are using a finite number of voxels). If you added a condition-specific pattern, the across-condition RSA would lead to much higher "representational strength" estimates than a within-condition RSA, with obvious problems and biases.

      The Reviewer is correct that the voxel values in the true pattern are drawn from i.i.d. standard normal distributions. We take the Reviewer’s suggestion of “condition-specific pattern” to mean that there could be a condition-voxel interaction in two non-mutually exclusive ways. The first is additive, essentially some common underlying multi-voxel pattern like [6, 34, -52, …, 8] for all condition A trials, and different one such pattern for condition B trials, etc. The second is multiplicative, essentially a vector of scaling factors [x1.5, x0.5, x0.8, …, x2.7] for all condition A trials, and a different one such vector for condition B trials, etc. Both possibilities could indeed affect tRSA as much as it would cRSA.

      Importantly, If such a strong condition-specific pattern is expected, one can build a condition-specific model RDM using one-shot coding of conditions (see example figure; src: https://www.newbi4fmri.com/tutorial-9-mvpa-rsa), to either capture this interesting phenomenon or to remove this out as a confounding factor. This practice has been applied in multiple regression cRSA approaches (e.g., Cichy et al., 2013) and can also be applied to tRSA.

      (7) The trial-level brain RDM to model Spearman correlations was analyzed using a mixed effects model. However, given the symmetry of the RDM, the correlations coming from different rows of the matrix are not independent, which is an assumption of the mixed effect model. This does not seem to induce an increase in Type I errors in the conditions studied, but there is no clear justification for this procedure, which needs to be justified.

      We appreciate this important warning, and now caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the supplement.

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models. The multilevel structure of RSA data introduces potential dependencies across subjects, stimuli, and trials, which can violate assumptions of independence if not properly modeled. In the present study, we used a model that included random intercepts for both subjects and stimuli, which accounts for variance at these levels and improves the generalizability of fixed-effect estimates. Still, there is a potential for systematic dependence across trials within a subject. To ensure that the model assumptions were satisfied, we conducted a series of diagnostic checks on an exemplar ROI (right LOC; middle occipital gyrus) in the Object Perception dataset, including visual inspection of residual distributions and autocorrelation (Appendix 3, Figure 13). These diagnostics supported the assumptions of normality, homoscedasticity, and conditional independence of residuals. In addition, we conducted permutation-based inference, similar to prior improvements to cRSA (Niliet al. 2014), using a nested model comparison to test whether the mean similarity in this ROI was significantly greater than zero. The observed likelihood ratio test statistic fell in the extreme tail of the null distribution (Appendix 3, Figure 14), providing strong nonparametric evidence for the reliability of the observed effect. We emphasize that this type of model checking and permutation testing is not merely confirmatory but can help validate key assumptions in RSA modeling, especially when applying mixed-effects models to neural similarity data. Researchers are encouraged to adopt similar procedures to ensure the robustness and interpretability of their findings”.

      Exemplar Permutation Testing

      To test whether the mean representational strength in the ROI right LOC (middle occipital gyrus) was significantly greater than zero, we used a permutation-based likelihood ratio test implemented via the permlmer function. This test compares two nested linear mixed-effects models fit using the lmer function from the lme4 package, both including random intercepts for Participant and Stimulus ID to account for between-subject and between-item variability.

      The null model excluded a fixed intercept term, effectively constraining the mean similarity to zero after accounting for random effects:

      ROI ~ 0 + (1 | Participant) + (1 | Stimulus)

      The full model included the same random effects structure but allowed the intercept to be freely estimated:

      ROI ~ 1 + (1 | Participant) + (1 | Stimulus)

      By comparing the fit of these two models, we directly tested whether the average similarity in this ROI was significantly different from zero. Permutation testing (1,000 permutations) was used to generate a nonparametric p-value, providing inference without relying on normality assumptions. The full model, which estimated a nonzero mean similarity in the right LOC (middle occipital gyrus), showed a significantly better fit to the data than the null model that fixed the mean at zero (χ²(1) = 17.60, p = 2.72 × 10⁻⁵). The permutation-based p-value obtained from permlmer confirmed this effect as statistically significant (p = 0.0099), indicating that the mean similarity in this ROI was reliably greater than zero. These results support the conclusion that the right LOC contains representational structure consistent with the HMAXc2 RSM. A density plot of the permuted likelihood ratio tests is plotted along with the observed likelihood ratio test in Appendix 3 Figure 14.

      (8) For the empirical data, it is not clear to me to what degree the "representational strength" of cRSA and tRSA is actually comparable. In cRSA, the Spearman correlation assesses whether the distances in the data RSM are ranked in the same order as in the model. For tRSA, the comparison is made for every row of the RSM, which introduces a larger degree of flexibility (possibly explaining the higher correlations in the first simulation). Thus, could the gains presented in Figure 7D not simply arise from the fact that you are testing different questions? A clearer theoretical analysis of the difference between the average row-wise Spearman correlation and the matrix-wise Spearman correlation is urgently needed. The behavior will likely vary with the structure of the true model RDM/RSM.

      We agree that the comparability between mean row-wise Spearman correlations and the matrix-wise Spearman correlation is needed. We believe that the simulations are the best approach for this comparison, since they are much more robust than the empirical dataset and have the advantage of knowing the true pattern/noise levels. We expand on our comparison of mean tRSA values and matrix-wise Spearman correlations on page 42.

      Page 42. “Although tRSA and cRSA both aim to quantify representational strength, they differ in how they operationalize this concept. cRSA summarizes the correspondence between RSMs as a single measure, such as the matrix-wise Spearman correlation. In contrast, tRSA computes such correspondence for each trial, enabling estimates at the level of individual observations. This flexibility allows trial-level variability to be modeled directly, but also introduces subtle differences in what is being measured. Nonetheless, our simulations showed that, although numerical differences occasionally emerged—particularly when comparing between-condition tRSA estimates to within-condition cRSA estimates—the magnitude of divergence was small and did not affect the outcome of downstream statistical tests”.

      (9) For the real data, there are a number of additional sources of bias that need to be considered for the analysis. What if there are not only condition-specific differences in noise variance, but also a condition-specific pattern? Given that the stimuli were measured in 3 different imaging runs, you cannot assume that all measurement noise is i.i.d. - stimuli from the same run will likely have a higher correlation with each other.

      We recognize the potential of condition-specific patterns and chose to constrain the analyses to those most comparable with cRSA. However, depending on their hypotheses, researchers may consider testing condition RSMs and utilizing a model comparison approach or employ the z-scored approach, as employed in the simulations above. Regarding the potential run confounds, this is always the case in RSA and why we exclude within-run comparisons. We have also added to the Discussion the suggestion to include run as a covariate in their mixed-effects models. However, we do not employ this covariate here as we preferred the most parsimonious model to compare with cRSA.

      Page 46 - 47. “Further, while analyses here were largely employed to be comparable with cRSA, researchers should consider taking advantage of the flexibility of the mixed-effects models and include co variates of non-interest (run, trial order etc.)”.

      (10) The discussion should be rewritten in light of the fact that the setting considered here is very different from the model-comparative RSA in which one usually has multiple measurements per stimulus per subject. In this setting, existing approaches such as RSA or PCM do indeed allow for the full modelling of differences in the "representational strength" - i.e., pattern variance across subjects, conditions, and stimuli.

      We agree that studies advancing designs with multiple repetitions of a given stimulus image are useful in estimating the reliability of concept representations. We would argue however that model comparison in RSA is not restricted to such data. Many extant studies do not in fact have multiple repetitions per stimulus per subject (Wang et al., 2018 https://doi.org/10.1088/1741-2552/abecc3, Gao et al, 2022 https://doi.org/10.1093/cercor/bhac058, Li et al, 2022 https://doi.org/10.1002/hbm.26195, Staples & Graves, 2020 https://doi.org/10.1162/nol_a_00018) that allow for that type of model-comparative approach. While beneficial in terms of noise estimation, having multiple presentations was not a requirement for implementing cRSA (Kriegeskorte, 2008 https://doi.org/10.3389/neuro.06.004.2008). The aim of this manuscript is to introduce the tRSA approach to the broad community of researchers whose research questions and datasets could vary vastly, including but not limited to the number of repeated presentations and the balance of trial counts across conditions.

      (11) Cross-validated distances provide a powerful tool to control for differences in measurement noise variances and possible covariances in measurement noise across trials, which has many distinct advantages and is conceptually very different from the approach taken here.

      We have added language on the value of cross-validation approaches to RSA in the Discussion:

      Page 47. “Additionally, we note that while our proposed tRSA framework provides a flexible and statistically principled approach for modeling trial-level representational strength, we acknowledge that there are alternative methods for addressing trial-level variability in RSA. In particular, the use of cross-validated distance metrics (e.g., crossnobis distance) has become increasingly popular for controlling differences in measurement noise variance and accounting for possible covariance structures across trials (Walther et al., 2016). These metrics offer several advantages, including unbiased estimation of representational dissimilarities under Gaussian noise assumptions and improved generalization to unseen data. However, cross-validated distances are conceptually distinct from the approach taken here: whereas cross-validation aims to correct for noise-related biases in representational dissimilarity matrices, our trial-level RSA method focuses on estimating and modeling the variability in representation strength across individual trials using mixed-effects modeling. Rather than proposing a replacement for cross-validated RSA, tRSA adds a complementary tool to the methodological toolkit—one that supports hypothesis-driven inference about condition effects and trial-level covariates, while leveraging the full structure of the data”.

      (12) One of the main limitations of tRSA is the assumption that the model RDM is actually the true brain RDM, which may not be the case. Thus, in theory, there could be a different model RDM, in which representational strength measures would be very different. These differences should be explained more fully, hopefully leading to a more accessible paper.

      Indeed, the chosen model RSM may not be the true RSM, but as the noise level increases the correlation between RSMs practically becomes zero. In our simulations we assume this to be true as a straightforward way to manipulate the correspondence between the brain data and the model. However, just like cRSA, tRSA is constrained by the model selections the researchers employ. We encourage researchers to have carefully considered theoretically-motivated models and, if their research questions require, consider multiple and potentially competing models. Furthermore, the trial-wise estimates produced by tRSA encourage testing competing models within the multiple regression framework. We have added this language to the Discussion.

      Page 46. ..”choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives”.

      Pages 45-46. “While a number of studies have addressed the validity of measuring representational geometry using designs with multiple repetitions, a conceptual benefit of the tRSA approach is the reliance on a regression framework that engenders the testing of competing conceptual models of stimulus representation (e.g., taxonomic vs. encyclopedic semantic features, as in Davis et al., 2021)”.

      Reviewer #2 (Public review):

      (1)  While I generally welcome the contribution, I take some issue with the accusatory tone of the manuscript in the Introduction. The text there (using words such as 'ignored variances', 'errouneous inferences', 'one must', 'not well-suited', 'misleading') appears aimed at turning cRSA in a 'straw man' with many limitations that other researchers have not recognized but that the new proposed method supposedly resolves. This can be written in a more nuanced, constructive manner without accusing the numerous users of this popular method of ignorance.

      We apologize for the unintended accusatory tone. We have clarified the many robust approaches to RSA and have made our Introduction and Discussion more nuanced throughout (see also 3, 11 and16).

      (2) The described limitations are also not entirely correct, in my view: for example, statistical inference in cRSA is not always done using classic parametric statistics such as t-tests (cf Figure 1): the rsatoolbox paper by Nili et al. (2014) outlines non-parametric alternatives based on permutation tests, bootstrapping and sign tests, which are commonly used in the field. Nor has RSA ever been conducted at the row/column level (here referred to by the authors as 'trial level'; cf King et al., 2018).

      We agree there are numerous methods that go beyond cRSA addressing these limitations and have added discussion of them into our manuscript as well as an example analysis implementing permutation tests on tRSA data (see response to 7). We thank the reviewer for bringing King et al., 2014 and their temporal generalization method to our attention, we added reference to acknowledge their decoding-based temporal generalization approach.

      Page 8. “It is also important to note that some prior work has examined similarly fine-grained representations in time-resolved neuroimaging data, such as the temporal generalization method introduced by King et al. (see King & Dehaene, 2014). Their approach trains classifiers at each time point and tests them across all others, resulting in a temporal generalization matrix that reflects decoding accuracy over time. While such matrices share some structural similarity with RSMs, they do not involve correlating trial-level pattern vectors with model RSMs nor do their second-level models include trial-wise, subject-wise, and item-wise variability simultaneously”.

      (3) One of the advantages of cRSA is its simplicity. Adding linear mixed effects modeling to RSA introduces a host of additional 'analysis parameters' pertaining to the choice of the model setup (random effects, fixed effects, interactions, what error terms to use) - how should future users of tRSA navigate this?

      We appreciate the opportunity to offer more specific proscriptions for those employing a tRSA technique, and have added them to the Discussion:

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models and choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives. However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (4) Here, only a single real fMRI dataset is used with a quite complicated experimental design for the memory part; it's not clear if there is any benefit of using tRSA on a simpler real dataset. What's the benefit of tRSA in classic RSA datasets (e.g., Kriegeskorte et al., 2008), with fixed stimulus conditions and no behavior?

      To clarify, our empirical approach uses two different tasks: an Object Perception task more akin to the classic RSA datasets employing passive viewing, and a Conceptual Retrieval task that more directly addresses the benefits of the trialwise approach. We felt that our Object Perception dataset is a simpler empirical fMRI dataset without explicit task conditions or a dichotomous behavioral outcome, whereas the Retrieval dataset is more involved (though old/new recognition is the most common form of memory retrieval testing) and  dependent on behavioral outcomes. However, we recognize the utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (5) The cells of an RDM/RSM reflect pairwise comparisons between response patterns (typically a brain but can be any system; cf Sucholutsky et al., 2023). Because the response patterns are repeatedly compared, the cells of this matrix are not independent of one another. Does this raise issues with the validity of the linear mixed effects model? Does it assume the observations are linearly independent?

      We recognize the potential danger for not meeting model assumptions. Though our simulation results and model checks suggest this is not a fatal flaw in the model design, we caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. See response to R1.

      (6) The manuscript assumes the reader is familiar with technical statistical terms such as Type I/II error, sensitivity, specificity, homoscedasticity assumptions, as well as linear mixed models (fixed effects, random effects, etc). I am concerned that this jargon makes the paper difficult to understand for a broad readership or even researchers currently using cRSA that might be interested in trying tRSA.

      We agree this jargon may cause the paper to be difficult to understand. We have expanded/added definitions to these terms throughout the methods and results sections.

      Page 12. “Given data generated with 𝑠<sub>𝑐𝑜𝑛𝑑,𝐴</sub> = 𝑠<sub>𝑐𝑜𝑛𝑑,B</sub>, the correct inference should be a failure to reject the null hypothesis of ; any significant () result in either direction was considered a false positive (spurious effect, or Type I error). Given data generated with , the inference was considered correct if it rejected the null hypothesis of  and yielded the expected sign of the estimated contrast (b<sub>B-𝐴</sub><0). A significant result with the reverse sign of the estimated contrast (b<sub>B-𝐴</sub><0) was considered a Type I error, and a nonsignificant (𝑝 ≥ 0.05) result was considered a false negative (failure to detect a true effect, or Type II error)”.

      Page 2. “Compared to cRSA, the multi-level framework of tRSA was both more theoretically appropriate and significantly sensitive (better able to detect) to true effects”.

      Page 25.”The performance of cRSA and tRSA were quantified with their specificity (better avoids false positives, 1 - Type I error rate) and sensitivity (better avoids false negatives 1 - Type II error rate)”.

      Page 6. “One of the fundamental assumptions of general linear models (step 4 of cRSA; see Figure 1D) is homoscedasticity or homogeneity of variance — that is, all residuals should have equal variance” .

      Page11. “Specifically, a linear mixed-effects model with a fixed effect  of condition (which estimates the average effect across the entire sample, capturing the overall effect of interest) and random effects of both subjects and stimuli (which model variation in responses due to differences between individual subjects and items, allowing generalization beyond the sample) were fitted to tRSA estimates via the `lme4 1.1-35.3` package in R (Bates et al., 2015), and p-values were estimated using Satterthwaites’s method via the `lmerTest 3.1-3` package (Kuznetsova et al., 2017)”.

      (7) I could not find any statement on data availability or code availability. Given that the manuscript reuses prior data and proposes a new method, making data and code/tutorials openly available would greatly enhance the potential impact and utility for the community.

      We thank the reviewer for raising our oversight here. We have added our code and data availability statements.

      Page 9. “Data is available upon request to the corresponding author and our simulations and example tRSA code is available at https://github.com/electricdinolab”.

      Reviewer #1 (Recommendations for the authors):

      (13) Page 4: The limitations of cRSA seem to be based on the assumption that within each different experimental condition, there are different stimuli, which get combined into the condition. The framework of RSA, however, does not dictate whether you calculate a condition x condition RDM or a larger and more complete stimulus x stimulus RDM. Indeed, in practice we often do the latter? Or are you assuming that each stimulus is only shown once overall? It would be useful at this point to spell out these implicit assumptions.

      We agree that stimulus x stimulus RDMs can be constructed and are often used. However, as we mentioned in the Introduction, researchers are often interested in the difference between two (or more) conditions, such as “remembered” vs. “forgotten” (Davis et al., https://doi.org/10.1093/cercor/bhaa269) or “high cognitive load” vs. “low cognitive load” (Beynel et al., https://doi.org/10.1523/JNEUROSCI.0531-20.2020). In those cases, the most common practice with cRSA is to construct condition-specific RDMs, compute cRSA scores separately for each condition, and then compare the scores at the group level. The number of times each stimulus gets presented does not prevent one from creating a model RDM that has the same rows and columns as the brain RDM, either in the same condition (“high load”) or across different conditions.

      (14) Page 5: The difference between condition-level and stimulus-level is not clear. Indeed, this definition seems to be a function of the exact experimental design and is certainly up for interpretation. For example, if I conduct a study looking at the activity patterns for 4 different hand actions, each repeated multiple times, are these actions considered stimuli or conditions?

      We have added clarifying language about what is considered stimuli vs conditions. Indeed, this will depend on the specific research questions being employed and will affect how researchers construct their models. In this specific example, one would most likely consider each different hand action a condition, treating them as fixed effects rather than random effects, given their very limited number and the lack of need to generalize findings to the broader “hand actions” category.

      Page 5. “Critically, the distinction between condition-level and stimulus level is not always clear as researchers may manipulate stimulus-level features themselves. In these cases, what researchers ultimately consider condition-level and stimulus-level will depend on their specific research questions. For example, researchers intending to study generalized object representation may consider object category a stimulus-level feature, while researchers interested in if/how object representation varies by category may consider the same category variable condition-level”.

      (15) Page 5: The fact that different numbers of trials / different levels of measurement noise / noise-covariance of different conditions biases non-cross-validated distances is well known and repeatedly expressed in the literature. We have shown that cross-validation of distances effectively removes such biases - of course, it does not remove the increased estimation variability of these distances (for a formal analysis of estimation noise on condition patterns and variance of the cross-nobis estimator, see (Diedrichsen et al. 2021)).

      We thank the reviewer for drawing our attention to this literature and have added discussions of these methods.

      (16). Page 5: "Most studies present subjects with a fixed set of stimuli, which are supposedly samples representative of some broader category". This may be the case for a certain type of RSA experiments in the visual domain, but it would be unfair to say that this is a feature of RSA studies in general. In most studies I have been involved in, we use a "stimulus" x "stimulus" RDM.

      We have edited this sentence to avoid the “most” characterization. We also added substantial text to the introduction and discussion distinguishing cRSA, which is nonetheless widely employed, especially in cases with a single repetition per stimulus (Macklin et al., 2023, Liu et al, 2024) and the model comparative method and explicitly stating that we do not consider tRSA an alternative to the model comparative approach.

      (17). Page 5: I agree that "stimuli" should ideally be considered a random effect if "stimuli" can be thought of as sampled from a larger population and one wants to make inferences about that larger population. Sometimes stimuli/conditions are more appropriately considered a fixed effect (for example, when studying the response to stimulation of the 5 fingers of the right hand). Techniques to consider stimuli/conditions as a random effect have been published by the group of Niko Kriegeskorte (Schütt et al. 2023).

      Indeed, in some cases what may be thought of as “stimuli” would be more appropriately entered into the model as a fixed effect; such questions are increasingly relevant given the focus on item-wise stimulus properties (Bainbridge et al., Westfall & Yarkoni). We have added text on this issue to the Discussion and caution researchers to employ models that most directly answer their research questions.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question. An effect is fixed when the levels represent the specific conditions of theoretical interest (e.g., task condition) and the goal is to estimate and interpret those differences directly. In contrast, an effect is random when the levels are sampled from a broader population (e.g., subjects) and the goal is to account for their variability while generalizing beyond the sample tested. Note that the same variable (e.g., stimuli) may be considered fixed or random depending on the research questions”.

      (18) Page 6: It is correct that the "classical" RSA depends on a categorical assignment of different trials to different stimuli/conditions, such that a stimulus x stimulus RDM can be computed. However, both Pattern Component Modelling (PCM) and Encoding models are ideally set up to deal with variables that vary continuously on a trial-by-trial or moment-by-moment basis. tRSA should be compared to these approaches, or - as it should be clarified - that the problem setting is actually quite a different one.

      We agree that PCM and encoding models offer a flexible approach and handle continuous trial-by-trial variables. We have clarified the problem setting in cRSA is distinct on page 6, and we have added the robustness of encoding models and their limitations to the Discussion.

      Page 6. “While other approaches such as Pattern Component Modeling (PCM) (Diedrichsen et al., 2018) and encoding models (Naselaris et al., 2011) are well-suited to analyzing variables that vary continuously on a trial-by-trial or moment-by-moment basis, these frameworks address different inferential goals. Specifically, PCM and encoding models focus on estimating variance components or predicting activation from features, while cRSA is designed to evaluate representational geometry. Thus, cRSA as well as our proposed approach address a problem setting distinct from PCM and encoding models”.

      (19) Page 8: "Then, we generated two noise patterns, which were controlled by parameters 𝜎 𝐴 and 𝜎𝐵, respectively, one for each condition." This makes little sense to me. The noise patterns should be unique to each trial - you should generate n_a + n_b noise patterns, no?

      We clarify that the “noise patterns” here are n_voxel x n_trial in size; in other words, all trial-level noise patterns are generated together and each trial has their own unique noise pattern. We have revised our description as “two sets of noise patterns” for clarity starting on page 9.

      (20) Page 9: First, I assume if this is supposed to be a hierarchical level model, the "noise parameters" here correspond to variances? Or do these \sigma values mean to signify standard deviations? The latter would make little sense. Or is it the noise pattern itself?

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (21) Page 10: your formula states "𝜎<sub>𝑠𝑢𝑏𝑗</sub>~ 𝙽(0, 0.5^2)". This conflicts with your previous mention that \sigmas are noise "levels" are they the noise patterns themselves now? Variances cannot be normally distributed, as they cannot be negative.

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (22) Page 13: What was the task of the subject in the Memory retrieval task? Old/new judgements relative to encoding of object perception?

      We apologize for the lack of clarity about the Memory Retrieval task and have added that information and clarified that the old/new judgements were relative to a separate encoding phase, the brain data for which has been reported elsewhere.

      Page 14. “Memory Retrieval took place one day after Memory Encoding and involved testing participants’ memory of the objects seen in the Encoding phase. Neural data during the Encoding phase has been reported elsewhere. In the main Memory Retrieval task, participants were presented with 144 labels of real-world objects, of which 114 were labels for previously seen objects and 30 were unrelated novel distractors. Participants performed old/new judgements, as well as their confidence in those judgements on a four-point scale (1 = Definitely New, 2 = Probably New, 3 = Probably Old, 4 = Definitely Old)”.

      (23) Page 13: If "Memory Retrieval consisted of three scanning runs", then some of the stimulus x stimulus correlations for the RSM must have been calculated within a run and some between runs, correct? Given that all within-run estimates share a common baseline, they share some dependence. Was there a systematic difference between the within-run and the between-run correlations?

      We have clarified in this portion of the methods that within run comparisons were excluded from our analyses. We also double-checked that the within-run exclusion was included in the description of the Neural RSMs.

      Page 14. “Retrieval consisted of three scanning runs, each with 38 trials, lasting approximately 9 minutes and 12 seconds (within-run comparisons were later excluded from RSA analyses)”.

      Page 18. “This was done by vectorizing the voxel-level activation values within each region and calculating their correlations using Pearson’s r, excluding all within-run comparisons.”

      (24) Page 20: It is not clear why the mean estimate of "representational strength" (i.e., model-brain RSM correlations) is important at all. This comes back to Major point #2, namely that you are trying to solve a very different problem from model-comparative RSA.

      We have clarified that our approach is not an alternative to model-comparative RSA, and that depending on the task constraints researchers may choose to compare models with tRSA or other approaches requiring stimulus repetition (see 3).

      (25) Page 21: I believe the problems of simulating correlation matrices directly in the way that the authors in their first simulation did should be well known and should be moved to an appendix at best. Better yet, the authors could start with the correct simulation right away.

      We agree the paper is more concise with these simulations being moved to the appendix and more briefly discussed. We have implemented these changes (Appendix 1). However, we are not certain that this problem is unknown, and have several anecdotes of researchers inquiring about this “alternative” approach in talks with colleagues, thus we do still discuss the issues with this method.

      (26) Page 26: Is the "underlying continuous noise variable 𝜎𝑡𝑟𝑖𝑎𝑙 that was measured by 𝑣𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 " the variance of the noise pattern or the noise pattern itself? What does it mean it was "measured" - how?

      𝜎𝑡𝑟𝑖𝑎𝑙 is a vector of standard deviations for different trials, and 𝜎𝑡𝑟𝑖𝑎𝑙 i would be used to generate the noise patterns for trial i. v_measured is a hypothetical measurement of trial-level variability, such as “memorability” or “heartbeat variability”. We have revised our description to clarify our methods.

      Reviewer #2 (Recommendations for the authors):

      (8) It would be helpful to provide more clarity earlier on in the manuscript on what is a 'trial': in my experience, a row or column of the RDM is usually referred to as 'stimulus condition', which is typically estimated on multiple trials (instances or repeats) of that stimulus condition (or exemplars from that stimulus class) being presented to the subject. Here, a 'trial' is both one measurement (i.e., single, individual presentation of a stimulus) and also an entry in the RDM, but is this the most typical scenario for cRSA? There is a section in the Discussion that discusses repetitions, but I would welcome more clarity on this from the get-go.

      We have added discussion of stimulus repetition methods and datasets to the Introduction and clarified our use of the terms.

      Page 8. “Critically, in single-presentation designs, a “trial” refers to one stimulus presentation, and corresponds to a row or column in the RSM. In studies with repeated stimuli, these rows are often called “conditions” and may reflect aggregated patterns across trials. tRSA is compatible with both cases: whether rows represent individual trials or averaged trials that create “conditions”, tRSA estimates are computed at the row level”.

      (9) The quality of the results figures can be improved. For example, axes labels are hard to read in Figure 3A/B, panels 3C/D are hard to read in general. In Figure 7E, it's not possible to identify the 'dark red' brain regions in addition to the light red ones.

      We thank the reviewer for raising these and have edited the figures to be more readable in the manner suggested.

      (10) I would be interested to see a comparison between tRSA and cRSA in other fMRI (or other modality) datasets that have been extensively reported in the literature. These could be the original Kriegeskorte 96 stimulus monkey/fMRI datasets, commonly used open datasets in visual perception (e.g., THINGS, NSD), or the above-mentioned King et al. dataset, which has been analyzed in various papers.

      We recognize the great utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (11) On P39, the authors suggest 'researchers can confidently replace their existing cRSA analysis with tRSA': Please discuss/comment on how researchers should navigate the choice of modeling parameters in tRSA's linear mixed effects setting.

      We have added discussion of the mixed-effects parameters and the various and encourage researchers to follow best practices for their model selection.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (12) The final part of the Results section, demonstrating the tRSA results for the continuous memorability factor in the real fMRI data, could benefit from some substantiation/elaboration. It wasn't clear to me, for example, to what extent the observed significant association between representational strength and item memorability in this dataset is to be 'believed'; the Discussion section (p38). Was there any evidence in the original paper for this association? Or do we just assume this is likely true in the brain, based on prior literature by e.g. Bainbridge et al (who probably did not use tRSA but rather classic methods)?

      Indeed, memorability effects have been replicated in the literature, but not using the tRSA method. We have expanded our discussion to clarify the relationship of our findings and the relevant literature and methods it has employed.

      Page 38. “Critically, memorability is a robust stimulus property that is consistent across participants and paradigms (Bainbridge, 2022). Moreover, object memorability effects have been replicated using a variety of methods aside from tRSA, including univariate analyses and representational analyses of neural activity patterns where trial-level neural activity pattern estimates are correlated directly with object memorability (Slayton et al, 2025).”

      (13) The abstract could benefit from more nuance; I'm not sure if RSA can indeed be said to be 'the principal method', and whether it's about assessing 'quality' of representations (more commonly, the term 'geometry' or 'structure' is used).

      We have edited the abstract to reflect the true nuisance in the current approaches.

      Abstract. Neural representation refers to the brain activity that stands in for one’s cognitive experience, and in cognitive neuroscience, a prominent method of studying neural representations is representational similarity analysis (RSA). While there are several recent advances in RSA, the classic RSA (cRSA) approach examines the structure of representations across numerous items by assessing the correspondence between two representational similarity matrices (RSMs): usually one based on a theoretical model of stimulus similarity and the other based on similarity in measured neural data.

      (14) RSA is also not necessarily about models vs. neural data; it can also be between two neural systems (e.g., monkey vs. human as in Kriegeskorte et al., 2008) or model systems (see Sucholutsky et al., 2023). This statement is also repeated in the Introduction paragraph 1 (later on, it is correctly stated that comparing brain vs. model is most likely the 'most common' approach).

      We have added these examples in our introduction to RSA.

      Page 3.”One of the central approaches for evaluating information represented in the brain is representational similarity analysis (RSA), an analytical approach that queries the representational geometry of the brain in terms of its alignment with the representational geometry of some cognitive model (Kriegeskorte et al., 2008; Kriegeskorte & Kievit, 2013), or, in some cases, compares the representational geometry of two neural systems (e.g., Kriegeskorte et al., 2008) or two model systems (Sucholutsky et al., 2023)”.

      (15) 'theoretically appropriate' is an ambiguous statement, appropriate for what theory?

      We apologize for the ambiguous wording, and have corrected the text:

      Page 11. “Critically, tRSA estimates were submitted to a mixed-effects model which is statistically appropriate for modeling the hierarchical structure of the data, where observations are nested within both subjects and stimuli (Baayen et al., 2008; Chen et al., 2021)”.

      (16) I found the statement that cRSA "cannot model representation at the level of individual trials" confusing, as it made me think, what prohibits one from creating an RDM based on single-trial responses? Later on, I understood that what the authors are trying to say here (I think) is that cRSA cannot weigh the contributions of individual rows/columns to the overall representational strength differently.

      We thank the reviewer for their clarifying language and have added it to this section of the manuscript.

      “Abstract. However, because cRSA cannot weigh the contributions of individual trials (RSM rows/columns), it is fundamentally limited in its ability to assess subject-, stimulus-, and trial-level variances that all influence representation”.

      (17) Why use "RSM" instead of "RDM"? If the pairwise comparison metric is distance-based (e..g, 1-correlation as described by the authors), RDM is more appropriate.

      We apologize for the error, and have clarified the Methods text:

      Page3-4. First, brain activity responses to a series of N trials are compared against each other (typically using Pearson’s r) to form an N×N representational similarity matrix.

      (18) Figure 2: please write 'Correlation estimate' in the y-axis label rather than 'Estimate'.

      We have edited the label in Figure 2.

      (19) Page 6 'leaving uncertain the directionality of any findings' - I do not follow this argument. Obviously one can generate an RDM or RSM from vector v or vector -v. How does that invalidate drawing conclusions where one e.g., partials out the (dis)similarity in e.g., pleasantness ratings out of another RDM/RSM of interest?

      We agree such an approach does not invalidate the partial method; we have clarified what we mean by “directionality”.

      Page 8. ”For instance, even though a univariate random variable , such as pleasantness ratings, can be conveniently converted to an RSM using pairwise distance metrics (Weaverdyck et al., 2020), the very same RSM would also be derived from the opposite random variable , leaving uncertain of the directionality (or if representation is strongest for pleasant or unpleasant items) of any findings with the RSM (see also Bainbridge & Rissman, 2018)”.

      (20) P7 'sampled 19900 pairs of values from a bi-variate normal distribution', but the rows/columns in an RDM are not independent samples - shouldn't this be included in the simulation? I.e., shouldn't you simulate first the n=200 vectors, and then draw samples from those, as in the next analysis?

      This section has been moved to Appendix 1 (see responses to Reviewer 1.13).

      (21) Under data acquisition, please state explicitly that the paper is re-using data from prior experiments, rather than collecting data anew for validating tRSA.

      We have clarified this in the data acquisition section.

      Page 13. “A pre-existing dataset was analyzed to evaluate tRSA. Main study findings have been reported elsewhere (S. Huang, Bogdan, et al., 2024)”.

      (22) Figure 4 could benefit from some more explanation in-text. It wasn't clear to me, for example, how to interpret the asterisks depicted in the right part of the figure.

      We clarified the meaning of the asterisks in the main text in addition to the existent text in the figure caption.

      Page 26. “see Figure 4, off-diagonal cells in blue; asterisks indicate where tRSA was statistically more sensitive then cRSA)”.

      (23) Page 38 "the outcome of tRSA's improved characterization can be seen in multiple empirical outcomes:" it seems there is one mention of 'outcomes' too many here.

      We have revised this sentence.

      Page 41. “tRSA's improved characterization can be seen in multiple empirical outcomes”.

      (24) Page 38 "model fits became the strongest" it's not clear what aspect of the reported results in the paragraph before this is referring to - the Appendix?

      Yes, the model fits are in the Appendix, we have added this in text citation.

      Moreover, model-fits became the strongest when the models also incorporated trial-level variables such as fMRI run and reaction time (Appendix 3, Table 6).

      References

      Diedrichsen, J., Berlot, E., Mur, M., Schütt, H. H., Shahbazi, M., & Kriegeskorte, N. (2021). Comparing representational geometries using whitened unbiased-distance-matrix similarity. Neurons, Behavior, Data and Theory, 5(3). https://arxiv.org/abs/2007.02789

      Diedrichsen, J., & Kriegeskorte, N. (2017). Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. PLoS Computational Biology, 13(4), e1005508.

      Diedrichsen, J., Yokoi, A., & Arbuckle, S. A. (2018). Pattern component modeling: A flexible approach for understanding the representational structure of brain activity patterns. NeuroImage, 180, 119-133.

      Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI. NeuroImage, 56(2), 400-410.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS Computational Biology, 10(4), e1003553.

      Schütt, H. H., Kipnis, A. D., Diedrichsen, J., & Kriegeskorte, N. (2023). Statistical inference on representational geometries. ELife, 12. https://doi.org/10.7554/eLife.82566

      Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. NeuroImage, 137, 188-200.

      King, M. L., Groen, I. I., Steel, A., Kravitz, D. J., & Baker, C. I. (2019). Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage, 197, 368-382.

      Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., ... & Bandettini, P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60(6), 1126-1141.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS computational biology, 10(4), e1003553.

      Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., ... & Griffiths, T. L. (2023). Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3:

      Comments on revised version:

      This revised version is in large improved and the responses to reviewers' comments are generally relevant. However, the response regarding pre-nodes is not satisfactory. I understand that the authors prefer to avoid further experimentations, but I think this is an important point that needs to be clarified. Exploring stages between E12 and E15 are therefore of importance. When carefully examining some of the figures (Fig. 1E or 2D) I think that at E15 they may well be pre-nodes formation prior to myelin deposition, on structure the authors considered to be heminodes. To be convincing they should use double or triple labeling with, in addition to the nodal proteins (ankG and/or Nav pan), a good myelin marker such as antiPLP. The rat monoclonal developed by late Pr Ikenaka would give a sharper staining than the anti MAG they used. (I assume the clone must still be available in Okazaki ).

      We appreciate your insightful comment regarding the possible presence of pre-nodal clusters along NM axons and your kind suggestion to use the PLP antibody (clone AA3; Yamamura et al., J Neurochem, 1991). We have obtained this monoclonal antibody from Dr. Kenji Tanaka previously in Okazaki and confirmed that it works well in chicken tissues. However, since this clone recognizes both PLP and DM-20 isoforms, it labels not only myelin-forming oligodendrocytes (MFOLs) but also newly formed oligodendrocytes (NFOLs) (Yokoyama et al., J Neurochem, 2025). Therefore, it is not ideal for determining whether nodal protein clusters are formed before myelin deposition.

      Instead, we performed double immunostaining for MAG and AnkG between E12 and E15 to clarify the temporal relationship between myelin maturation and node formation. The results showed that detectable AnkG clusters along NM axons began to appear very sparsely around E13, coinciding with the emergence of MAG signals, and became more prominent with development. This temporal pattern does not match the definition of pre-nodal clusters, which are formed prior to myelination.

      Although we cannot completely rule out the possibility of undetectable pre-nodal clusters or those composed of molecules other than AnkG, our results support the view that pre-nodal clusters are unlikely to play a major role in determining the regional difference in nodal spacing along NM axons. These new data have been added as Figure 2—figure supplement 1, and the relevant sections in the Results, Discussion, and Figure legend have been revised accordingly (page 5, line 4; page 10, line 7; page 29, line 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors attempted to clarify the impact of N protein mutations on ribonucleoprotein (RNP) assembly and stability using analytical ultracentrifugation (AUC) and mass photometry (MP). These complementary approaches provide a more comprehensive understanding of the underlying processes. Both SV-AUC and MP results consistently showed enhanced RNP assembly and stability due to N protein mutations.

      The overall research design appears well planned, and the experiments were carefully executed.

      Strengths:

      SV-AUC, performed at higher concentrations (3 µM), captured the hydrodynamic properties of bulk assembled complexes, while MP provided crucial information on dissociation rates and complex lifetimes at nanomolar concentrations. Together, the methods offered detailed insights into association states and dissociation kinetics across a broad concentration range. This represents a thorough application of solution physicochemistry.

      We thank the Reviewer for this positive assessment. 

      Weaknesses:

      Unlike AUC, MP observes only a part of the solution. In MP, bound molecules are accumulated on the glass surface (not dissociated), thus the concentration in solution should change as time develops. How does such concentration change impact the result shown here?

      We agree with the Reviewer that the concentration in solution above the surface will change with time; however, the impact of surface adsorption turns out to be negligible. To show this we have added a calculation as Supplementary Methods that is based on the number of imaged adsorption events, the fraction of imaged area to total surface area, and the initial sample volume and concentration. Under our experimental conditions the reduction is less than 1%, which is well within the range of experimental concentration errors.

      This is in line with the observation that surface adsorption of proteins to glass is critical and needs to be prevented when working at picomolar concentrations (Zhao H, Mayer ML, Schuck P. 2014. Analysis of protein interactions with picomolar binding affinity by fluorescence-detected sedimentation velocity. Anal Chem 86:3181–3187. doi:10.1021/ac500093m), but is ordinarily negligible when working at the mid nanomolar concentration range. The difference in the MP experiments is that where usually the surface adsorption to glass and plastic is invisible, it is being imaged and quantified in MP. The negligible impact of surface adsorption on solution concentration in typical MP experiments is also in line with the results of several studies that have successfully measured dissociation constants of binding equilibria by MP (Young G et al., Science 360 (2018) 432; Wu & Piszczeck, Anal Biochem 592 (2020) 113575; Solterman et al. Angewandte Chemie 59 (2020) 10774) with samples in the 5-50 nM range and similar experimental setup. It should be noted that in the MP experiments no surface functionalization is employed, in contrast to optical biosensors that utilize surface-immobilized ligands and polymeric matrices and thereby enhance the surface binding capacity.

      Even though this depletion effect is negligible under ordinary MP conditions, the Reviewer raises a good point and readers may have a similar question with this novel technique. For this reason, we have added in the MP section of the Methods the sentence “In either configuration, the impact of surface binding on the sample concentration is < 1% and negligible, as described in the Supplementary Methods S1.” and added the detailed calculations in the Supplement accordingly. The use of SV as a traditional, orthogonal technique and the observation of consistent results with those of MP should further dispel readers’ methodological concerns in this point.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors apply a variety of biophysical and computational techniques to characterize the effects of mutations in the SARS-CoV-2 N protein on the formation of ribonucleoprotein particles (RNPs). They find convergent evolution in multiple repeated independent mutations strengthening binding interfaces, compensating for other mutations that reduce RNP stability but which enhance viral replication.

      Strengths:

      The authors assay the effects of a variety of mutations found in SARS-CoV-2 variants of concern using a variety of approaches, including biophysical characterization of assembly properties of RNPs, combined with computational prediction of the effects of mutations on molecular structures and interactions. The findings of the paper contribute to our increasing understanding of the principles driving viral self-assembly, and increase the foundation for potential future design of therapeutics such as assembly inhibitors.

      Thank you for highlighting the strengths of our paper and the potential impact on future design of therapeutics.

      Weaknesses:

      For the most part, the paper is well-written, the data presented support the claims made, and the arguments are easy to follow. However, I believe that parts of the presentation could be substantially improved. I found portions of the text to be overly long and verbose and likely could be substantially edited; the use of acronyms and initialisms is pervasive, making parts of the exposition laborious to follow; and portions of the figures are too small and difficult to read/understand.

      We are glad the Reviewer concurs the data support our conclusions, and finds the arguments easy to follow.  We appreciate the comment that the work was not optimally presented. To address this point, we have identified multiple opportunities to streamline the text without jeopardizing the clarity. We have also rewritten the end of the Introduction.

      As recommended, we have reduced and harmonized the use of acronyms and abbreviations throughout the text to improve readability. Specifically, we have now spelled out nucleic acid (NA), intrinsically disordered regions (IDR), full-length (FL), AlphaFold (AF3), and variants of concern (VOC).

      Finally, we have improved the presentation of most figures, adding labels and new panels, and increased the label font sizes to facilitate more detailed inspections of the data.

      Reviewer #3 (Public Review):

      This manuscript investigates how mutations in the SARS-CoV-2 nucleocapsid protein (N) alter ribonucleoprotein (RNP) assembly, stability, and viral fitness. The authors focus on mutations such as P13L, G214C, and G215C, combining biophysical assays (SV-AUC, mass photometry, CD spectroscopy, EM), VLP formation, and reverse genetics. They propose that SARS-CoV-2 exploits "fuzzy complex" principles, where distributed weak interfaces in disordered regions allow both stability and plasticity, with measurable consequences for viral replication.

      Strengths:

      (1) The paper demonstrates a comprehensive integration of structural biophysics, peptide/protein assays, VLP systems, and reverse genetics.

      (2) Identification of both de novo (P13L) and stabilizing (G214C/G215C) interfaces provides a mechanistic insight into RNP formation.

      (3) Strong application of the "fuzzy complex" framework to viral assembly, showing how weak/disordered interactions support evolvability, is a significant conceptual advance in viral capsid assembly.

      (4) Overall, the study provides a mechanistic context for mutations that have arisen in major SARS-CoV-2 variants (Omicron, Delta, Lambda) and a mechanistic basis for how mutations influence phenotype via altered biomolecular interactions.

      We are grateful for these comments highlighting this work as a significant conceptual advance.

      Weaknesses:

      (1) The arrangement of N dimers around LRS helices is presented in Figure 1C, but the text concedes that "the arrangement sketched in Figure 1C is not unique" (lines 144-146) and that AF3 modeling attempts yielded "only inconsistent results" (line 149).

      The authors should therefore present the models more cautiously as hypotheses instead. Additional alternative arrangements should be included in the Supplementary Information, so the readers do not over-interpret a single schematic model.

      We agree that in the absence of high-resolution structures the RNP models are hypothetical, and have now emphasized this in the Results, following the Reviewer’s recommendation. To present alternative arrangements that satisfy the biophysical constraints upfront, we have promoted the previous Supplementary Figure 11 showing different models to the first Supplementary Figure, and expanded it with examples of different oligomers. In this way it is referenced early on in the Results and in the legend to Figure 1C. We agree this strengthens the manuscript, as one of the take-home messages is the inherent polydispersity of the RNPs.

      The fact that AF3 can only provide inconsistent results will not come as a surprise, given the substantial disordered regions of the complex, and is a drawback of AF3 rather than our structural model. We slightly emphasized this point so as to clarify that the presentation of the AF3-based RNP structure serves solely as supporting evidence that our hypothetical model is sterically reasonable.

      The new Results paragraph reads:

      “As suggested in the cartoon of Figure 1C, this supports the hypothesis of a three-dimensional arrangement with a central LRS oligomer with symmetry properties and dimensions similar to low resolution EM images of model RNPs (Carlson et al., 2022, 2020) and cryo-ET of RNPs in virions (Klein et al., 2020; Yao et al., 2020).  It should be noted, however, that the arrangement sketched in Figure 1C is not unique and other subunit orientations could be envisioned that satisfy all constraints from experimentally observed binding interfaces, including different oligomers and anti-parallel subunits as illustrated in Supplementary Figure S1. Extending previous ColabFold structural predictions that show multiple N-protein dimers self-assembled via the LRS coiled-coils (Zhao et al., 2023), we attempted the AlphaFold modeling of RNPs combining multiple N dimers with SL7 RNA ligands, mimicking our biophysical assembly model. Current AlphaFold restrictions limit the prediction to pentamers of N-protein dimers with 10 copies of SL7 RNA. While only inconsistent results were obtained – which is not surprising given the large intrinsically disordered regions exceed the predictive power of AlphaFold – some models did produce an overall RNP organization similar to Figure 1C, suggesting such an arrangement is at least sterically reasonable with regard to possible N-protein subunit orientations in an RNP (Supplementary Figure S2)”

      (2) Negative-stained EM fibrils (Figure 2A) and CD spectra (Figure 2B) are presented to argue that P13L promotes β-sheet self-association. However, the claim could benefit from more orthogonal validation of β-sheet self-association. Additional confirmation via FTIR spectra or ThT fluorescence could be used to further distinguish structured β-sheets from amorphous aggregation.

      We completely agree that the application of multiple orthogonal biophysical methods can strengthen the conclusions. In addition to EM fibrils and CD spectra (a classical gold standard technique for protein secondary structure in solution), we already have support from ColabFold modeling, as well as NMR results from the Zweckstetter lab showing the potential for for β-sheet-like conformations.

      Furthermore, we believe the evidence for the absence of ‘amorphous aggregates’ is very strong, as this would be inconsistent with the long-range order required to create the visibly fibrillar morphology in EM, and amorphous aggregates would be inconsistent with the increased solution viscosity. In this context, it is also highly relevant that the β-sheet-like secondary structure recorded by CD is concentration-dependent and reversible upon dilution. The long-range spatial order of fibrils is consistent with the formation of secondary structure in solution.

      In addition, it must be kept in mind that what we see is specific to N-arm peptides carrying the P13L mutation (in EM, CD, and structural prediction) and does not occur in the other two N-arm peptides (ancestral N-arm and N-arm with deletion of 31-33), linker peptides, or C-arm peptides.

      Most importantly, as elaborated in more detail below, we do not claim that fibril formation is physiologically relevant. At the heart of this – in the context of the evolution of fuzzy complexes – is that the P13L mutation creates additional weak protein-protein interactions. Indeed, the assembly of fibrils geometrically requires at least two interfaces for each subunit. These weak interactions are at play physiologically in the context of the disordered RNP particles, and in macromolecular condensates, but not in the formation of fibrils. Therefore, while we appreciate the suggestion for FTIR spectra ThT staining, we are afraid further emphasis on the fibril structure might confuse the reader, and therefore we would rather clarify upfront that these fibrillar assemblies are not thought to form in vivo from full-length protein, but merely demonstrate the presence of N-arm self-association interfaces in the model of truncated peptides.

      Accordingly, we have amended the Results paragraph reporting the fibrils:

      “Thus, the N-arm mutation P13L is responsible for the formation of fibrils in N-arm peptides after prolonged storage. Some of these N-arm fibrils exhibit a twisted morphology with width of »5 nm (Figure 2A), in some instances exhibiting patterns of strand breaks. Such fibrils are frequently encountered in proteins that can stack β-sheets, such as in amyloids (Paravastu et al., 2008). While we have not observed fibril formation in the context of full-length N, and have no evidence such fibrils are physiologically relevant, their occurrence in solutions of truncated N-arm peptide nonetheless demonstrates the introduction of ordered N-arm self-association interfaces in conformations of P13L mutants.”

      And more completely summarized experimental evidence prior to describing the ColabFold prediction results (which previously did not include mention of the NMR):

      “Finally, confirming the interpretation of the EM images and the CD data, as well as the b-structure propensity reported from NMR data (Zachrdla et al., 2022), the structural prediction of N[10-20]:P13L in ColabFold displayed oligomers with stacking b-sheets …”

      (3) In the main text, the authors alternate between emphasizing non-covalent effects ("a major effect of the cysteines already arises in reduced conditions without any covalent bonds," line 576) and highlighting "oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs". Therefore, the biological relevance of disulfide redox chemistry in viral assembly in vivo remains unclear. Discussing cellular redox plausibility and whether the authors' oxidizing conditions are meant as a mechanistic stress test rather than physiological mimicry could improve the interpretation of these results.

      The paper could benefit if the authors provide a summary figure or table contrasting reduced vs. oxidized conditions for G214C/G215C mutants (self-association, oligomerization state, RNP stability). Explicitly discuss whether disulfides are likely to form in infected cells.

      We thank the Reviewer for raising this most interesting point.  The reason why the biological relevance of N dilsulfides remains unclear is simply that this is still unknown, unfortunately. Recently, Kubinski et al. have strongly argued for the formation of disulfides in infected cells, but in our view the evidence remains weak since the majority of disulfide bonds in that work presented as post-lysis artifacts, and it appears the non-covalent effects alone could explain the physiological observations. We aimed for a balanced presentation and wrote in the relevant Results section:

      “Covalent disulfide bonds in the LRS in non-reducing conditions were found to further promote LRS oligomerization. However, there is no conclusive data yet whether covalent bonds in the LRS occur in vivo, or any G215C effect is entirely non-covalent due to the significant strengthening of LRS helix oligomerization (see Discussion).”

      Despite the uncertainty regarding physiological disulfide bond formation, we believe it is useful to ask whether covalently crosslinked N dimers would aid or constrain RNP assembly in our biophysical model. We have now better explained this motivation in the Results section describing the RNP experiments:

      “Even though it is still unclear whether disulfide bonds of N cysteine mutants form in vivo, we were curious about the impact of disulfide-linked oligomers of the cysteine mutants on their RNP structure and stability in our biophysical assembly model.”

      The referenced paragraph from the Discussion reads:

      “Regarding the cysteine mutations that have been repeatedly introduced in the LRS prior to the rise of the Omicron VOCs, it is an open question whether they lead to covalent bonds in vivo or in the VLP assay. While examples of disulfide-linked viral nucleocapsid proteins have been reported (Kubinski et al., 2024; Prokudina et al., 2004; Wootton and Yoo, 2003), a methodological difficulty in their detection is artifactual disulfide bond formation post-lysis of infected cells (Kubinski et al., 2024; Wootton and Yoo, 2003).  However, our results clearly show that a major effect of the cysteines already arises in reduced conditions without any covalent bonds, through extension of the LRS helices, and concomitant redirection of the disordered N-terminal sequence. While oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs, the covalent bonds provided only marginally improved RNP stability.  Interestingly, the introduction of cysteines imposes preferences of RNP oligomeric states dependent on oxidation state, consistent with our MD simulations highlighting the impact of cysteine orientation of 214C versus 215C relative to the hydrophobic surface of the LRS helices. Overall, considering potentially detrimental structural constraints from covalent bonds on LRS clusters seeding RNPs, energetic penalties on RNP disassembly, as well as the required monomeric state of the LRS helix for interaction with the NSP3 Ubl domain (Bessa et al., 2022), at present it is unclear to what extent the formation of disulfide linkages between LRS helices would be beneficial or detrimental in the viral life cycle.”

      We feel that this text addresses the Reviewer’s comment, and that expanding the existing discussion further would conflict with other recommendations to shorten and focus the text.

      Finally, we have addressed the valuable suggestion of a new table summarizing the oligomeric state and self-association of the different cysteine mutants by inserting a new column in the existing Table 1 reporting all species’ oligomeric state at low micromolar concentrations. In this way they can be compared at a glance with the other mutants as well. A more detailed comparison of the concentration-dependent size-distribution is provided in Figure 4.

      (4) VLP assays (Figure 7) show little enhancement for P13L or G215C alone, whereas Figure 8 shows that P13L provides clear fitness advantages. This discrepancy is acknowledged but not reconciled with any mechanistic or systematic rationale. The authors should consider emphasizing the limitations of VLP assays and the sources of the discrepancy with respect to Figure 8.

      We thank the Reviewer for this comment, which highlights a very important point. 

      For clarification and to improve the cohesion of the manuscript we have inserted a reference to the Discussion after the presentation of the VLP results, which provides a natural transition to the following description of the reverse genetics experiments:

      “As expanded on in the Discussion, the failure to observe enhancement by P13L alone may be related to limitations of the VLP assay in sensitivity, including the restriction to a single round of infection, and protein expression levels.”

      This references a paragraph in the Discussion about the limitations of the VLP assay in general and the reasons we believe the enhancement by P13L alone was not picked up:

      “…While this assay has been widely used for rapid assessment of spike protein and N variants (Syed et al., 2021), it has limitations due to the addition of non-genomic RNA and the lack of double membrane vesicles from which gRNA emerges through the NSP3/NSP4 pore complex potentially poised for packaging (Bessa et al., 2022; Ke et al., 2024; Ni et al., 2023). It should also be recognized that the results do not directly reflect the relative efficiency of RNP assembly only, since protein expression levels, their localization, and their posttranslational modifications are not controlled for. Susceptibility for such factors might be exacerbated with mutations that modulate weak protein interactions. For example, as shown previously (Syed et al., 2024; Zhao et al., 2024), a GSK3 inhibitor inhibiting N-protein phosphorylation significantly enhances VLP formation and eliminates the advantage provided for by the N:G215C mutation relative to the ancestral N – presumably due to an increase in assembly-competent, non-phosphorylated N-protein erasing an affinity advantage. A similar process may be underlying the absent or marginal improvement in VLP readout from the cysteine LRS mutants and P13L at the achieved transfection level in the present work, and the enhanced signal from R203K/G204R and R203M (the latter being consistent with previous reports (Li et al., 2025; Syed et al., 2021)) modulating protein phosphorylation. Nonetheless, mirroring the results of the biophysical in vitro experiments, the addition of RNP-stabilizing P13L and G214C mutations on top of R203K/G204R led to a significantly larger VLP signal.

      The VLP assay may be limited in sensitivity to mutation effects due to its restriction to a single round of infection. To avoid this and other potential limitations of the VLP assay for the study of viral packaging, for the key mutation N:P13L we carried out reverse genetics experiments. These showed the sole N:P13L mutation significantly increases viral fitness (Figure 8).”

      (5) Figures 5 and 6 are dense, and the several overlays make it hard to read. The authors should consider picking the most extreme results to make a point in the main Figure 5 and move the other overlays to the Supplementary. Additionally, annotating MP peaks directly with "2×, 4×, 6× subunits" can help non-experts.

      We completely agree with the Reviewer – these figures were very dense.  To mitigate this problem without having the reader to switch back-and-forth to the supplement, we subdivided the panels of Figure 5 and showed only a subset of curves in each.  In this way the data are easier to read while still readily compared. It is a large figure, but it contains the key data for the present work and is therefore worthwhile to have in one place. For the MP histogram data we also have inserted the suggested peak labels. Similarly, we have split Figure 6A into two panels for clarity.

      (6) The paper has several names and shorthand notations for the mutants, making it hard to keep up. The authors could include a table that contains mutation keys, with each shorthand (Ancestral, Nο/No, Nλ, etc.) mapped onto exact N mutations (P13L, Δ31-33, R203K/G204R, G214C/G215C, etc.). They could then use the same glyphs (Latin vs Greek) consistently in text and figure labels.

      Yes, we agree this is a problem and we apologize for the confusion. However, it is not possible to refer exclusively to either Latin or Greek terminology, which we feel would be even more detrimental to readability (the former being exhaustively lengthy and the latter being imprecise). But we have used a rational system: If the complete set of mutations of a variant are present, then its Greek letter will be used as an abbreviation, and otherwise we use Latin amino acid/position indicators for individual mutations or combinations thereof. Unfortunately, previously we inadvertently failed to explicitly mention this, and we are most grateful for the Reviewer to point this out.

      We have now rectified this by including upfront the sentence:

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N<sub>­­λ</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      This will define the two shorthands N<sub>λ</sub> and N<sub>ο</sub> used. Furthermore, as suggested and pointed to in the text, Table 1 does provide the keys to mutation and variants, including the information in which variant any of the other mutations studied here occur.

      (7) The EM fibrils (Figure 2A) and CD spectra (Figure 2B) were collected at mM peptide concentrations. These are far above physiological levels and may encourage non-specific aggregation. Similarly, the authors mention" ultra-weak binding energies that require mM concentrations to significantly populate oligomers". On the other hand, the experiments with full-length protein were performed at concentrations closer to biologically relevant concentrations in the micromolar range. While I appreciate the need to work at high concentrations to detect weak interactions, this raises questions about physiological relevance.

      This is indeed an important point to clarify. We agree that much lower nucleocapsid protein concentrations are present in the cytosol on average, and these were used in our RNP assembly experiments. However, there are at least two important physiologically relevant cases where high local N concentrations do occur:

      (1) Once assembled in RNPs, the disordered N-terminal extensions are locally at a very high concentration within the volume they can explore while tethered to the NTD. A back-of-the-envelope calculation assuming 12 N-protein subunits confining 12 N-terminal extensions to the volume of a single RNP (≈14x14x14 nm<sup>3</sup> by cryoEM; Klein et al 2020) leads to an effective concentration of 7.4 mM. Obviously the N-arm peptides are not completely free and there will be constraints that would hinder or promote encounter complex probability, but interfaces with mM Kd are clearly strong enough to populate Narm-Narm contacts extending from N-protein in the RNP.

      Additionally, any interaction where N-proteins are brought in close proximity could allow weak N-arm interactions to provide additional stability. Besides the RNP, we demonstrate this in our Results for nucleic-acid liganded N tetramers (Figure 4B), but this might similarly occur in complexes with NSP3 or host proteins. Generally, it is quite common that small additional binding energies play important roles in the modulation of multivalent protein complexes.

      (2) Within the macromolecular condensate the local concentration will be substantially higher than on average within the infected cell.  While we do not know its precise concentration, it is well-established that the sum of many ultra-weak interactions is driving the formation of this dense liquid phase. In our previous eLife paper (Nguyen et al., 2024) we have shown LLPS is suppressed with the R203K/G204R mutation, but it is ‘rescued’ with the additional P13L/del31-33 mutation of the Omicron variant showing strong LLPS. Similarly, LLPS is suppressed by the LRS mutant L222P, but rescued in conjunction with P13L. This is another biologically relevant scenario where weak interactions are critical.

      We have emphasized these points in the revised manuscript as described below.

      Specifically:

      (a) Could some of the fibril/β-sheet features attributed to P13L (Figure 2A-C) reflect non-specific aggregation at high concentrations rather than bona fide self-association motifs that could play out in biologically relevant scenarios?

      We understand this concern from the experience with proteins that often have limited solubility and tendencies to aggregate, sometimes accompanied by unfolding and driven by hydrophobic interactions, or clustering on the path to LLPS. However, we are struggling to reconcile the picture of non-specific aggregation with the context of our P13L N-arm peptides. The term ‘non-specific aggregation’ implies the idea of amorphous aggregates, which we would contend is inconsistent with the observed geometry of fibrils, which exhibit long-range order. In addition, non-specific aggregation does not lead to increased solution viscosity, which we describe, but fibril formation does. Another connotation of ‘aggregates’ is irreversibility.  However, we find the beta-sheet-like conformation seen at 1 mM becomes significantly more disordered when the same sample is diluted to 0.4 mM peptide. This is consistent with a reversible self-association driven by a conformational change toward ordered secondary structure.

      To highlight the reversibility, we have clarified the description: “Interestingly, diluting the 1 mM sample (solid) to a concentration of 0.4 mM (dashed) reveals a large shift in the far-UV spectra … both indicative of a significant increase of disorder upon dilution. This is consistent with the stabilization of b-sheets in a reversible, strongly cooperative self-association process with an effective K<sub>D</sub> in the high mM to low mM range.”

      We have also inserted a concentration conversion to mg/ml units, which shows even 1 mM of peptides is only ~5 mg/ml, i.e. not excessively high. “While the ancestral N-arm at »1 mM (» 4.6 mg/ml) concentrations exhibits CD spectra with a minimum at »200 nm typical of disordered conformations (black)”

      With regard to the question of specificity, we have studied similar N-arm peptides without P13L mutations and with the 31-33 deletion under equivalent conditions. But we observe the reversible self-association, conformational change, and fibril formation only for those containing the P13L mutation, consistent with ColabFold predictions. Neither did we observe fibrils with disordered C-arm peptides.

      How these weak self-association motifs in the N-arm can be physiologically relevant in the context of full-length protein modulating the stability of multi-molecular complexes and enhancing LLPS was outlined above, and further clarified in the manuscript as detailed below.

      (b) How do the authors justify extrapolating from the mM-range peptide behaviors to the crowded but far lower effective concentrations in cells?

      As pointed out above, the key to this question is the local preconcentration as the N-arm peptides are tethered to the rest of protein in the context of flexible multi-molecular assemblies. Another mechanism to consider is the formation of condensates. The response to the next comment will expand on this.

      The authors should consider adding a dedicated section (either in Methods or Discussion) justifying the use of high concentrations, with estimation of local concentrations in RNPs and how they compare to the in vitro ranges used here. For concentration-dependent phenomena discussed here, it is vital to ensure that the findings are not artefacts of non-physiological peptide aggregation..

      The use of high concentration in biophysical experiments is quite common, for example, in NMR or crystallography, insofar as they elucidate molecular properties. We believe this is obvious; the Reviewer will certainly agree with us, and this does not require further elaboration. The property observed in this case is the existence of specific, weak protein self-association interfaces in the N-arm.

      Our response to the Reviewer’s point 7(a) addresses the distinction between artefactual aggregation and self-association of N-arm peptides. The relevance of these weak protein self-association interfaces in the context of the full-length protein is the second underlying question.

      As we have previously stated in a dedicated Results paragraph:

      “In contrast to the modulation of the coiled-coil LRS interfaces, the de novo creation of the N-arm self-association interface through beta-sheet interactions enabled by P13L cannot be readily observed in full-length N-protein at low M concentrations. Similar to the ancestral LRS interface, it provides only ultra-weak binding energies that require mM concentrations to significantly populate oligomers. This is fully consistent with the previous observation by SV-AUC that neither N:P13L,31-33 nor N<sub>o</sub> with the full set of Omicron mutations show any significant higher-order self-association at low M concentrations, whereas at high local concentrations – as observed in phase-separated droplets – they can modulate and cooperatively enhance self-association processes (Nguyen et al., 2024). (If fact, P13L can substitute for the LRS promoting LLPS, as observed in the rescue of LLPS by N:P13L,31-33/L222P mutants whereas N:L222P LRS-abrogating mutants are deficient in LLPS.) Another process that increases the local concentration of N-arm chains is the tetramerization of full-length N-protein. As described earlier, occupancy of the NA-binding site in the NTD allosterically promotes self-assembly of the LRS into higher oligomers (Zhao et al., 2021). We hypothesized that these oligomers may be cooperatively stabilized by additional N-arm interactions in P13L mutants.”

      To state completely unambiguously why weak interfaces are important, we have followed the Reviewer’s suggestion and added an additional clarification already earlier, at the end of the P13L Results section:

      “While this self-association interface in the P13L N-arm is weak and its direct observation in biophysical experiments requires mM concentrations, which far exceed average intracellular concentration of N, such  weak interactions can become highly relevant physiologically when high local concentrations are prevailing, for example, when the disordered extension is preconcentrated while tethered within macromolecular assemblies as in the RNP, or in macromolecular condensates.”

      Furthermore, we have added early in the Discussion:

      “Even though the solution affinity of the N-arm P13L interface is ultra-weak, the average local concentration of N-arm chains across the RNP volume (in a back-of-the-envelope calculation assuming a ≈14 nm cube (Klein et al., 2020) with a dodecameric N cluster) is ≈7.4 mM, such that disordered N-arm peptides could well create populations of N-arm clusters stabilizing RNPs through this interface.  However, besides the RNP-stabilizing mutants we have also observed unexpected RNP destabilization by the ubiquitous R203K/G204R double mutation, which may be caused by the introduction of additional charges close to the self-association interface in the LRS. In our experiments, this destabilization is more than compensated for by the P13L mutation. (Another scenario where ultra-weak interactions can have a critical impact is in molecular condensates. We previously reported the suppression of LLPS by the R203K/G204R mutation, which is rescued by the additional P13L/Δ31-33 mutation (Nguyen et al., 2024). This is consistent with compensatory weak stabilizing and destabilizing impacts of weak interactions on the RNP observed here.)”

      Reviewer #1 (Recommendations for the Authors):

      In Figure 1B, it is unclear what the orange lines connecting polypeptides represent, as well as the zig-zag orange lines in the N-arm.

      We thank the Reviewer for this comment. We intended this to represent regions of self-association but recognize the patterned background is confusing. We have changed this now to solid-colored backgrounds, and indicated this in the figure legend:

      “Regions of self-association are indicated by shaded backgrounds.”

      Regarding presentation, in Figure 5 (MP), the relationship between mass and oligomer size should be shown more clearly.

      We agree. To this end we have labeled the peaks in the MP histograms in Figure 5 with the oligomeric state of the 2N/2SL7 subunits.

      Reviewer #2 (Recommendations for the Authors):

      I find the science of the paper to be convincing and compellingly supported.

      Thank you for this positive statement.

      My primary complaints are with presentation or minor technical questions that, honestly, primarily arise due to my own ignorance and unfamiliarity with some of the techniques employed.

      My primary issue is with the figures. I find, generally, the text in axes labels, ticks, and legends to be too small to comfortably read. This is particularly true in the CD spectra and

      other data presented in Figures 1D, 2B, 4, 5, 6, and 8.

      We agree and have increased the font size of all text and labels of the plots in Figure 1, 2, 4, 5, 6, and 8.

      I also found the use of initialisms to be a bit overbearing and inconsistent. For example, the authors repeatedly switch between spelling out "nucleic acid" and the initialism "NA" (which is also never explicitly spelled out in the text). With the already substantial length of the text, my own personal opinion would be to suggest spelling out all initialisms in the interest of making the reading easier.

      This is a valid criticism. To improve the readability, we have followed this advice and systematically spelled out “nucleic acid” instead of using “NA”.  Similarly, we have now written out full-length instead of the abbreviation FL, and omitted the abbreviation IDR for intrinsically disordered regions, as well as VOC for variant of concern, and AF3 for AlphaFold.

      Regarding the reference to mutants, we have now explained upfront the system of Latin and Greek nomenclature we consistently applied.

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N­­<sub>l</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      I found the text to be verbose, bordering on overly so; the Introduction is more than two pages long. The section "Enhanced oligomerization of the leucine-rich sequence through cysteine mutations" has two long paragraphs of introduction before the present results are discussed, et cetera. An (admittedly, very rough) estimation of the length of the paper places it at ~9,000 -10,000 words long, and I think that the presentation might benefit from significant editing and

      shortening.

      We agree the manuscript is longer than would be desirable, and we generally prefer not to insert mini-introductions into Results sections. On the other hand, in order to make a solid contribution to understanding the big picture of fuzzy complexes in molecular evolution of RNA virus proteins it is indispensable to go into the details of RNP assembly and several of the interfaces. Therefore, we feel the length is in the range that it needs to be without losing clarity. In addition, other Reviewer suggestions to extend the discussion, for example, of limitations of VLP assays and the in vivo state of cysteines, conflict with significant shortening.

      In the particular case of the cysteine mutations, cited by the Reviewer, we believe it is important to add detailed background on G215C, because the Results proceed in a comparison of the self-association mode between G215C and G214C. This is of significant interest in the present context not only for the independent introduction of interface-enhancing mutations highlighting the evolution of fuzzy complexes, but also because it illustrates the pleomorphic ability of RNPs.

      Nonetheless, we have slightly shortened this text and merged the background into a single paragraph. More generally, we have critically reread the text to remove tangential sentences where possible and to make it more concise.

      I have a few more specific comments.

      In Figure 1A, I suggest explicitly labeling the location of the LRS, as it comes up repeatedly.

      Yes, we thank the Reviewer for this suggestion and have introduced this label in Figure 1A.

      In Figure 1B, the legend indicates that the red lines indicate "new inter-dimer interactions." However, these red lines are overlayed on a vertical stripe of red squiggles; it is unclear to me and not explicitly described in the legend what these squiggles are meant to illustrate.

      We agree this background was confusing. As mentioned in our Response to Reviewer #1 we have replaced the structured background with a solid background and explained in the figure legend that these areas depict regions of self-association.

      On lines 44-45, the authors state, "The IDRs amount to 45%, ..." 45% of what?

      Thank you, this was unclear.  We have now clarified “The IDRs amount to ≈45% of total residues”

      In lines 244 - 246, the authors compare the sizes of complexes in reducing versus non- reducing conditions as measured by dynamic light scattering, stating, "However, dynamic light scattering (DLS) revealed the presence of N210-246:G214C complexes with hydrodynamic radii 244 ranging from 6 to 40 nm (in comparison to 1-2 nm for N210- 246:G215C(Zhao et al., 2022)) in reducing conditions, and slightly larger in non-reducing conditions (Supplementary Figure S4)." Using this single statistic seems to me to be a less-than-ideal way of characterizing what seems to me to be happening here. In Supplementary Figure 4, it appears to me that what is happening is that in non-reduced conditions, the sample is monodisperse, whereas in reducing conditions, the distribution becomes polydisperse/bimodal, with two clearly separate populations. I feel that this could use a more

      thorough description rather than just stating the overall range of particle sizes.

      Yes, the Reviewer is correct – it is indeed a good idea to be more precise here. To this end we have carried out cumulant analyses on the autocorrelation functions, as a time-honored method to quantify the polydispersity.  Both samples are polydisperse, but more so in reducing conditions. We have now added “For N210-246:G214C a cumulant analysis results in radii of 8.8 nm and 10.6 nm and polydispersity indices of 0.40 and 0.35 for reducing and non-reducing conditions, respectively”

      Finally, I have one remaining comment that is a result of my own inexperience with circular dichroism and interpreting the spectra. For me personally, I would appreciate a more thoroughdescription/illustration of the statistics involved in the CD spectra, but perhaps this is not necessary for people who are more familiar with interpreting these kinds of data. For example, in Figure 1D, it is not clear to me what the error bars/confidence intervals for the CD data look like. I see many squiggles, some of which the authors claim are significant (e.g., the differences between ~215 - 230 nm), and others are not worthy of comment. Let's say, for example, that I fit a smoothed spline through these data and then measure the magnitude of the fluctuations from that spline to define/quantify confidence intervals. What does that distribution look like? Or maybe the confidence intervals are so small that all squiggles are significant?

      Thank you, this is a good question. As mentioned in the methods section, the CD spectra shown are averages of triplicate scans. Therefore, it is straightforward to extract the standard deviation at each wavelength from the three measurements (although a spline would probably work just as well). The values are what one would expect for the squiggles to be random noise. In the region 215 – 220 nm characteristic for helical secondary structure the standard deviations are small relative to the separation between curves, which indicates that the differences are highly significant. Naturally, the curves do overlap in other spectral regions, which would make a plot including the wavelength-dependent error bars or confidence bands too crowded. Therefore, we have kept the plot of the averaged triplicate scans, but have now provided the average standard deviations for all species in the figure legend and mentioned their significant separation:

      “Triplicate scans yield average standard deviations of 0.13 (N), 0.17 (N+SL7), 0.16 (N<sub>l</sub>), and 0.21 (N<sub>l</sub> +SL7) 10<sup>3</sup> deg cm<sup>2</sup>/dmol, respectively, with non-overlapping confidence bands for the different species, for example, between 215-220 nm.”

      Reviewer #3 (Recommendations for the Authors):

      (1) The Discussion reiterates much of the background (mutational tolerance, fuzziness, SLiMs) already covered in the Introduction, diluting focus on the key new findings. The authors should consider shortening and refocusing the discussion on the main contributions in light of existing knowledge of viral assembly.

      In the Introduction we have provided background on intrinsically disordered proteins in general and their mutational tolerance, as well as the concept of fuzzy complexes. The first several paragraphs of the Discussion have a different focus, which is protein binding interfaces between viral proteins (obviously key in fuzzy complexes), specifically their modulation and the remarkable de novo introduction of binding interfaces. We believe this deserves emphasis, since this highlights a novel aspect of fuzziness, for the mutant spectrum of RNA viruses to encode a range and of assembly stabilities and architectures. 

      To reduce redundancy between the end of the Introduction and the beginning of the Discussion, we have shortened the last paragraph of the Introduction and removed its preview of the conclusions, as described in the response to the next comment of the Reviewer (see below).

      Unfortunately, the length of the Discussion is dictated in part also by the need to discuss methodological aspects, among them the limitations of VLP assays, and the redox state of the cysteine in the LRS mutants, which were important points recommended by other suggestions of the Reviewers. Similarly, we believe the discussion of other potential functions of Omicron N-arm mutations is warranted, as well as the background of the R203K/G204R double mutation that has attracted significant attention in the field due to its effects on phosphorylation and expression of truncated N species that also form RNPs. Our goal was to integrate the results by us and other laboratories regarding specific mutation effects into a comprehensive picture of molecular evolution of N, which we believe the framework of fuzzy complexes can provide.

      (2) The Abstract and early Introduction set a broad stage (IDPs, fuzziness), but don't explicitly state the concrete hypotheses that the experiments test. Please add 2-3 sentences in the Introduction that enumerate testable hypotheses, e.g.:

      (a) P13L creates a new N-arm interface that increases RNP stability.

      (b) G214C/G215C strengthens LRS oligomerization to stabilize higher-order N assemblies.

      We agree the introduction can be improved.  However, it seems to us that it cannot be neatly framed in the hypothesis – answer dichotomy, without losing a lot of nuances and without requiring an even longer and more detailed introduction.

      One of the main questions is to test whether the framework of fuzzy complexes can be applied to understand molecular evolution of N, and we feel the introduction is already flowing well towards this:

      “ … In fuzzy complexes the total binding energy is distributed into multiple distinct ultra-weak interaction sites (Olsen et al., 2017). Similar to individual RNA virus proteins with loose or absent structure, maintaining disorder and a spatial distribution of low-energy interactions in the protein complexes may increase the tolerance for mutations and improve evolvability of protein complexes.\

      The unprecedented worldwide sequencing effort of SARS-CoV-2 genomes during its rapid evolution in humans provides a unique opportunity to examine these concepts. ...”

      To bring this to a more concrete set of questions in the end, we have shortened and rewritten the last paragraph in the Introduction:

      “To examine how architecture and energetics of RNP assemblies can be impacted by N-protein mutations we study a panel of N-proteins derived from ancestral Wuhan-Hu-1 and different VOCs, including Alpha, Delta, Lambda, and Omicron (see Table 1), in biophysical experiments, VLP assays, and mutant virus. Specifically, we ask how the RNP size distribution and life-time is modulated by: (1) the novel binding interface created by the P13L mutation of Omicron; (2) enhancements of other weak self-association interfaces through G215C of Delta and G214C of Lambda; (3) the ubiquitous R203K/G204R double mutation of Alpha, Lambda, and Omicron.  We also test whether the P13L mutation improves viral fitness, similar to G215C and R203K/G204R. The results are discussed in the framework of fuzzy complexes and molecular evolution of N in the course of viral adaptation to the human host. Understanding the salient features of the binding interfaces in viral assembly and their evolution expands our foundation for the design of therapeutics such as assembly inhibitors.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):  

      From my reading, this study aimed to achieve two things:  

      (1) A neurally-informed account of how Pieron's and Fechner's laws can apply in concert at distinct processing levels.  

      (2) A comprehensive map in time and space of all neural events intervening between stimulus and response in an immediately-reported perceptual decision.  

      I believe that the authors achieved the first point, mainly owing to a clever contrast comparison paradigm, but with good help also from a new topographic parsing algorithm they created. With this, they found that the time intervening between an early initial sensory evoked potential and an "N2" type process associated with launching the decision process varies inversely with contrast according to Pieron's law. Meanwhile, the interval from that second event up to a neural event peaking just before response increases with contrast, fitting Fechner's law, and a very nice finding is that a diffusion model whose drift rates are scaled by Fechner's law, fit to RT, predicts the observed proportion of correct responses very well. These are all strengths of the study.   

      We thank the reviewer for their comments that added context to the events we detected in relation to previous findings. We also believe that the change in the HMP algorithm suggested by the reviewer improved the precision of our analyses and the manuscript. We respond to the reviewer’s specific comments below.

      (1) The second, generally stated aim above is, in the opinion of this reviewer, unconvincing and ill-defined. Presumably, the full sequence of neural events is massively task-dependent, and surely it is more in number than just three. Even the sensory evoked potential typically observed for average ERPs, even for passive viewing, would include a series of 3 or more components - C1, P1, N1, etc. So are some events being missed? Perhaps the authors are identifying key events that impressively demarcate Pieron- and Fechner-adherent sections of the RT, but they might want to temper the claim that they are finding ALL events. In addition, the propensity for topographic parsing algorithms to potentially lump together distinct processes that partially co-evolve should be acknowledged.  

      We agree with the reviewer that the topographical solutions found by HMP will be dependent on the task and the quality and type of data. We address this point in the last section of the discussion (see also response to R3.5). We would also like to add that the events detected by HMP are, by construction, those that contribute to the RT and not necessarily all ERPs elicited by a stimulus.

      In addition to the new last section of the discussion we also make these points clear in the revised manuscript at the discussion start: 

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we  aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task”.

      Regarding the typical visual ERPs, in response to this comment but also comments R1.2, R1.3 and R2.1, we aimed for a more precise description of the topographies and thus reduced the width of the HMP expected events to 25ms. This ensures that we do not miss events shorter than the initial expectations of 50ms (see Appendix B of Weindel et al., 2024 and also response to  R1.3). This new estimation provides evidence for at least two of the visual ERPs that, based on their timings and topographies (in relation with the spatial frequency of the stimulus), we interpret as the N40 and the P100 (see response to R1.5 for the justification of this categorization). We provide a description and justification of the interpretations in the result section “Five trial-recurrent sequential events occur in the EEG during decisions” and the discussion section “Visual encoding time”.

      (2) To take a salient example, the last neural event seems to blend the centroparietal positivity with a more frontal midline negativity, some of which would capture the CNV and some motor-execution related components that are more tightly time-locked to, of course, the response. If the authors plotted the traditional single-electrode ERP at the frontal focus and centroparietal focus separately, they are likely to see very different dynamics and contrast- and SAT-dependency. What does this mean for the validity of the multivariate method? If two or more components are being lumped into one neural event, wouldn't it mean that properties of one (e.g., frontal burstiness at response) are being misattributed to the other (centroparietal signal that also peaks but less sharply at response)?

      Using the new HMP parameterization described above we show that the reviewer's intuition was correct. Using an expected pattern duration of 25ms the last event in the original manuscript splits in two events. The before-last event, now referred to the lateralized readiness potential (LRP) presents a strong lateralization (Figure 3) with an increased negativity over the motor cortex contralateral to the right hand. The effect of contrast is mostly on the last event that we interpret as the CPP (Figure 5). Despite the improved precision of the topographies of the identified events, it is however to be noted that some components will overlap. If the LRP is generated when a certain amount of evidence is accumulated (e.g. that the CPP crosses a certain value) then a time-based topography will necessarily include that CPP activity in addition to the lateralized potential. We discuss this in the section “Motor execution” of the discussion:

      “Adding the abrupt onset of this potential, we believe that this event is the start of motor execution, engaged after a certain amount of evidence. The evidence for this interpretation is manifest in the fact that the event's topography shares some activity with the CPP event that follows, an expected result if the LRP is triggered at a certain amount of evidence, indexed by the CPP”.

      (3) Also related to the method, why must the neural events all be 50 ms wide, and what happens if that is changed? Is it realistic that these neural events would be the same duration on every trial, even if their duration was a free parameter? This might be reasonable for sensory and motor components, but unlikely for cognitive.  

      The HMP method is sensitive to the event's duration as shown in the manuscript about the method (Appendix B of Weindel et al., 2024). Nevertheless as long as the topography in the real data is longer than the expected one it shouldn't be missed (i.e. same goes for by-trial variations in the event width). For this reason we halved the expected event width of 50ms (introduced by the original HsMM-MVPA paper by Anderson and colleagues) in the revision. This new estimation with 25ms thus is much less likely to miss events as evidenced by the new visual and motor events. In the revised manuscript this is addressed at the start of the Results section:

      “Contrary to previous applications (Anderson et al.,2016; Berberyan et al., 2021; Zhang et al., 2018; Krause et al., 2024) we assumed that the multivariate pattern was represented by a 25ms half-sine as our previous research showed that a shorter expected pattern width increases the likelihood of detecting cognitive events (see Appendix B of Weindel et al., 2024)”.

      Regarding the event width as a free parameter this is both technically and statistically difficult to implement as the amount of computing capacity, flexibility and trade-offs among the HMP parameters would, given the current implementation, render the model unfit for most computers and statistically unidentifiable.

      (4) In general, I wonder about the analytic advantage of the parsing method - the paradigm itself is so well-designed that the story may be clear from standard average event-related potential analysis, and this might sidestep the doubts around whether the algorithm is correctly parsing all neural events.  

      Average ERP analysis suffers from an impossibility to differentiate between an effect of an experimental factor on the amplitude vs. on the timing of the underlying components (Luck, 2005). Furthermore the overlap of components across trials bluries the distinction between them. For both reasons we would not be able to reach the same level of certainty and precision using ERP analyses. Furthermore the relatively low number of trials per experimental cell (contrast level X SAT X participant = 6 trials) makes the analyses hard to perform on ERP which typically require more trials per modality. From the reviewer’s comment we understand that this point was not clear. We therefore discuss this in the revision, Section “Functional interpretation of the events” of the results:

      “Nevertheless identifying neural dynamics on these ERPs centered on stimulus is complicated by the time variation of the underlying single-trial events (see probabilities displayed in Figure 3 for an illustration and Burle et al., 2008, for a discussion). The likely impact of contrast on both amplitude and time on the underlying single-trial event does not allow one to interpret the average ERP traces as showing an effect in one or the other dimension without strong assumptions (Luck, 2005)”.

      (5) In particular, would the authors consider plotting CPP waveforms in the traditional way, across contrast levels? The elegant design is such that the C1 component (which has similar topography) will show up negative and early, giving way to the CPP, and these two components will show opposite amplitude variations (not just temporal intervals as is this paper's main focus), because the brighter the two gratings, the stronger the aggregate early sensory response but the weaker the decision evidence due to Fechner. I believe this would provide a simple, helpful corroborating analysis to back up the main functional interpretation in the paper.  

      We agree with the suggestion and have introduced the representation on top of Figure 5 for sets of three electrodes in the occipital, posterior and frontal regions. The new panels clearly show an inversion of the contrast effect dependent on the time and locus of the electrodes. We discuss this in Section “Functional interpretation of the events” of the results:

      “This representation shows that there is an inversion of the contrast effect with higher contrasts having a higher amplitude on the electrodes associated with visual potentials in the first couple of deciseconds (left panel of Figure 5A) while parietal and frontal electrodes shows a higher amplitude for lower contrasts in later portions of the ERPs (middle and right panel of Figure 5A)”.

      To us, this crucially shows that we cannot achieve the same decomposition using traditional ERP analyses. In these plots it appears that while, as described by the reviewer, there is an inversion, the timing and amplitude of the changes due to contrast can hardly be interpreted.

      (6) The first component is picking up on the C1 component (which is negative for these stimulus locations), not a "P100". Please consult any visual evoked potential study (e.g., Luck, Hillyard, etc). It is unexpected that this does not vary in latency with contrast - see, for example. Gebodh et al (2017, Brain Topography) - and there is little discussion of this. Could it be that nonlinear trends were not correctly tested for?  

      We disagree with the reviewer on the interpretation of the ERP. The timing of the detected component is later than the one usually associated with a C1. Furthermore the central display does not create optimal conditions to detect a C1

      We do agree that the topography raises the confusion but we believe that this is due to the spatial frequency of the stimulus that generates a high posterior positivity (see references in the following extract). The new HMP solution also now happens to show an effect of contrast on the P100 latencies, we believe this is due to the increased precision in the time location of the component. We discuss this in the “Visual encoding time” section of the discussion:

      “The following event, the P100, is expressed around 70ms after the N40, its topography is congruent with reports for stimuli with low spatial frequencies as used in the current study (Kenemans et al., 2002, 2000; Proverbio et al., 1996). The timing of this P100 component is changed by the contrast of the stimulus in the direction expected by the Piéron law (Figure 4A)”. 

      (7) There is very little analysis or discussion of the second stage linked to attention orientation - what would the role of attention orientation be in this task? Is it spatial attention directed to the higher contrast grating (and if so, should it lateralise accordingly?), or is it more of an alerting function the authors have in mind here?  

      We agree that we were not specific enough on the interpretation of this attention stage. We now discuss our hypothesis in the section “Attention orientation” of the discussion:  

      “We do however observe an asymmetry in the topographical map Figure 3. This asymmetry might point to an attentional bias with participants (or at least some participants) allocating attention to one side over the other in the same way as the N2pc component (Luck and Hillyard, 1994, Luck et al., 1997). Based on this collection of observations, we conclude that this third event represents an attention orientation process. In line with the finding of Philiastides et al. (2006), this attention orientation event might also relate to the allocation of resources. Other designs varying the expected cognitive load or spatial attention could help in further interpreting the functional role of this third event”.

      We would like to add that it is unlikely that the asymmetry we mention in the discussion cannot stem from the redirection towards higher contrast as the experimental design balanced the side of presentation. We therefore believe that this is a behavioral bias rather than a bias toward the highest contrast stimulus as suggested by the reviewer. We hope that, while more could be tested and discussed, this discussion is sufficient given the current manuscript's goal.

      Reviewer #2 (Public review):  

      Summary:  

      The authors decomposed response times into component processes and manipulated the duration of these processes in opposing directions by varying contrast, and overall by manipulating speed-accuracy tradeoffs. They identify different processes and their durations by identifying neural states in time and validate their functional significance by showing that their properties vary selectively as expected with the predicted effects of the contrast manipulation. They identify 3 processes: stimulus encoding, attention orienting, and decision. These map onto classical event-related potentials. The decision-making component matched the CPP, and its properties varied with contrast and predicted decision-accuracy, while also exhibiting a burst not characteristic of evidence accumulation.  

      Strengths:  

      The design of the experiment is remarkable and offers crucial insights. The analysis techniques are beyond state-of-the-art, and the analyses are well motivated and offer clear insights.  

      Weaknesses:  

      It is not clear to me that the results confirm that there are only 3 processes, since e.g., motor preparation and execution were not captured. While the authors discuss this, this is a clear weakness of the approach, as other components may also have been missed. It is also unclear to what extent topographies map onto processes, since, e.g., different combinations of sources can lead to the same scalp topography.  

      We thank the reviewer for their kind words and for the attention they brought on the question of the missing motor preparation event. In light of this comment (and also R1.1, R3.3) the revised manuscript uses a finer grained approach for the multivariate event detection. This preciser estimation comes from the use of a shorter expected pattern in which the initial expectation of a 50ms half-sine was halved, therefore ensuring that we do not miss events shorter than the initial expectations (see Appendix B of Weindel et al., 2024 and also response to  R1.3). In the new solution the motor component that the reviewer expected is found as evidenced by the topography of the event, its lateralization and a time-to-response congruent with a response execution event. This is now described in the section “Motor execution” of the revised manuscript: 

      “The before last event, identified as the LRP, shows a strong hemispheric asymmetry congruent with a right hand response. The peak of this event is approximately 100 ms before the response which is congruent with reports that the LRP peaks at the onset of electromyographical activity in the effector muscle (Burle et al., 2004), typically happening 100ms before the response in such decision-making tasks (Weindel et al., 2021). Furthermore, while its peak time is dependent on contrast, its expression in the EEG is less clearly related to the contrast manipulation than the following CPP event”.

      Reviewer #3 (Public review):  

      Summary:  

      In this manuscript, the authors examine the processing stages involved in perceptual decision-making using a new approach to analysing EEG data, combined with a critical stimulus manipulation. This new EEG analysis method enables single-trial estimates of the timing and amplitude of transient changes in EEG time-series, recurrent across trials in a behavioural task. The authors find evidence for three events between stimulus onset and the response in a two-spatial-interval visual discrimination task. By analysing the timing and amplitude of these events in relation to behaviour and the stimulus manipulation, the authors interpret these events as related to separable processing stages for stimulus encoding, attention orientation, and decision (deliberation). This is largely consistent with previous findings from both event-related potentials (across trials) and single-trial estimates using decoding techniques and neural network approaches.  

      Strengths:  

      This work is not only important for the conceptual advance, but also in promoting this new analysis technique, which will likely prove useful in future research. For the broader picture, this work is an excellent example of the utility of neural measures for mental chronometry.  

      We appreciate the very positive review and thank the reviewer for pointing out important weaknesses in our original manuscript and also providing resources to address them in the recommendations to authors. Below we comment on each identified weakness and how we addressed them.   

      Weaknesses:  

      (1) The manuscript would benefit from some conceptual clarifications, which are important for readers to understand this manuscript as a stand-alone work. This includes clearer definitions of Piéron's and Fechner's laws, and a fuller description of the EEG analysis technique.

      We agree that the description of both laws were insufficient, we therefore added the following text in the last paragraph of the introduction:

      “Piéron’s law predicts that the time to perceive the two stimuli (and thus the choice situation) should follow a negative power law with the stimulus intensity (Figure 1, green curve). In contradistinction, Fechner’s law states that the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches (Figure 1, yellow curve). As the task of our participants is to judge the contrast difference, Piéron’s law should predict the time at which the comparison starts (i.e. the stimuli become perceptible), while Fechner’s law should implement the comparison, and thus decision, difficulty”.

      Regarding the EEG analysis technique we added a few elements at the start of the result:

      “The hidden multivariate pattern model (HMP) implemented assumed that a task-related multivariate pattern event is represented by a half-sine whose timing varies from trial to trial based on a gamma distribution with a shape parameter of 2 and a scale, controlling the average latency of the event, free-to-vary per event (Weindel et al., 2024)”.

      We also made the technique clearer at the start of the discussion:

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task. In addition to the number of events, using this hidden multivariate pattern approach (Weindel et al., 2024) we estimated the trial-by-trial probability of each event’s peak, therefore accessing at which time sample each event was the most likely to occur”.

      Additionally, we added a proper description in the method section (see the new first paragraph of the “Hidden multivariate pattern” subsection). 

      (2) The manuscript, broadly, but the introduction especially, may be improved by clearly delineating the multiple aims of this project: examining the processes for decision-making, obtaining single-trial estimates of meaningful EEG-events, and whether central parietal positivity reflects ramping activity or steps averaged across trials.

      For the sake of clarity we removed the question of the ramping activity vs steps in the introduction and focused on the processes in decision-making and their single-trial measurement as this is the main topic of the paper. Furthermore the references provided by the reviewer allowed us to write a more comprehensive review of previous studies and how the current study is in line with those. These changes are mainly manifested in these new sentences:

      “As an example Philiastides et al. (2006) used a classifier on the EEG activity of several conditions to show that the strength of an early EEG component was proportional to the strength of the stimulus while a later component was related to decision difficulty and behavioral performance (see also Salvador et al., 2022; Philiastides and Sajda, 2006). Furthermore the authors interpreted that a third EEG component was indicative of the resource allocated to the upcoming decision given the perceived decision difficulty. In their study, they showed that it is possible to use single-trial information to separate cognitive processes within decision-making. Nevertheless, their method requires a decoding approach, which requires separate classifiers for each component of interest and restrains the detection of the components to those with decodable discriminating features (e.g. stimuli with strong neural generators such as face stimuli, see Philiastides et al., 2006)”.

      (3) A fuller discussion of the limitations of the work, in particular, the absence of motor contributions to reaction time, would also be appreciated. 

      As laid out in responses to comments R1.1 and R2 the new estimates now include evidence for a motor preparation component. We discuss this in the new “motor execution” paragraph in the discussion section. Additionally we discuss the limitation of the study and the method in the two last paragraphs of the discussion (in the new Section “Generalization and limitation”).

      (4) At times, the novelty of the work is perhaps overstated. Rather, readers may appreciate a more comprehensive discussion of the distinctions between the current work and previous techniques to gauge single-trial estimates of decision-related activity, as well as previous findings concerning distinct processing stages in decision-making. Moreover, a discussion of how the events described in this study might generalise to different decision-making tasks in different contexts (for example, in auditory perception, or even value-based decision-making) would also be appreciated.  

      We agree that the original text could be read as overstating. In addition to the changes linked to R3.2 we also now discuss the link with the previous studies in the before-last paragraph of the discussion before the conclusion in the new “Generalization and limitations” section:

      “The present study showed what cognitive processes are contributing to the reaction time and estimated single-trial times of these processes for this specific perceptual decision-making task. The identified processes and topographies ought to be dependent on the task and even the stimuli (e.g. sensory events will change with the sensory modality). More complex designs might generate a higher number of cognitive processes (e.g. memory retrieval from a cue, Anderson et al., 2016) and so could more natural stimuli which might trigger other processes in the EEG (e.g. appraisal vs. choice as shown by Frömer et al., 2024). Nevertheless, the observation of early sensory vs. late decision EEG components is likely to generalize across many stimuli and tasks as it has been observed in other designs and methods (Philiastides et al., 2006; Salvador et al., 2022). To these studies we add that we can evaluate the trial-level contribution, as already done for specific processes (e.g. Si et al., 2020; Sturm et al., 2016), for the collection of events detected in the current study”.

      Reviewing Editor Comments:  

      As you will see, all three reviewers agree that the paper makes a valuable contribution and has many strengths. You will also see that they have provided a range of constructive comments highlighting potential issues with the interpretation of the outcomes of your signal decomposition method. In particular, all three reviewers point out that your results do not identify separate motor preparation signals, which we know must be operating on this type of task. The reviewers suggest further discussion of this issue and the potential limitations of your analysis approach, as well as suggesting some additional analyses that could be run to explore this further. While making these changes would undoubtedly enhance the paper and the final public reviews, I should note that my sense is that they are unlikely to change the reviewers' ratings of the significance of the findings and the strength of evidence in the final eLife assessment  

      Reviewer #1 (Recommendations for the authors):  

      (1) Abstract: "choice onset" is ill-defined and not the label most would give the start of the RT interval. Do you mean stimulus onset?  

      We replaced with "choice onset" with "stimulus onset" in the abstract

      (2) Similarly "choice elements" in the introduction seem to refer to sensory attributes/objects being decided about?  

      We replaced "choice-elements" with "choice-relevant features of the stimuli"

      (3) "how the RT emerges from these putative components" - it would be helpful to specify more what level of answer you're looking for, as one could simply answer "when they're done."  

      We replaced with "how the variability in RTs emerges from these putative components"

      (4) Line 61-62: I'm not sure this is a fully correct characterisation of Frömer et al. It was not similar in invoking a step function - it did not invoke any particular mechanism or function, and in that respect does not compare well to Latimer et al. Also, I believe it was the overlap of stimulus-locked components, not response-locked, that they argued could falsely generate accumulator-like buildup in the response-locked ERP.  

      We indeed wrongly described Frömer et al. The sentence is now "In human EEG data, the classical observation of a slowly evolving centro-parietal positivity, scaling with evidence accumulation, was suggested to result from the overlap of time-varying stimulus-related activity in the response-locked event related potential"

      (5) Line 78: Should this be single-trial *latency*?  

      This referred to location in time but we agree that the term is confusing and thus replaced it with latencies.

      (6) The caption of Figure 1 should state what is meant by the y-axis "time"  

      We added the sentence "The y-axis refers the time predicted by each law given a contrast value (x-axis) and the chosen set of parameters." in the caption of Figure 1

      (7) Line 107: Is this the correct description of Fechner's law? If the perceived difference follows the log of the physical difference, then a constant physical difference should mean a constant perceived difference. Perhaps a typo here.  

      This was indeed a typo we replaced the corresponding part of the sentence with "the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches"

      (8) Line 128: By scale, do you mean magnitude/amplitude?  

      No, this refers to the parameter of a gamma distribution. To clarify we edited the sentence:  "based on a gamma distribution with a shape parameter of 2 and a scale parameter, controlling the average latency of the event, free-to-vary per event"

      (9) The caption of Figure 3 is insufficient to make sense of the top panel. What does the inter-event interval mean, and why is it important to show? What is the "response" event?  

      We agree that the top panel was insufficiently described. To keep the length of the paper short and because of the relatively low amount of information provided by these panels we replaced them for a figure only showing the average topographies as well as the asymmetry tests for each event.

      (10) Figure 4: caption should say what the top vs bottom row represents (presumably, accuracy vs speed emphasis?), and what the individual dots represent, given the caption says these are "trial and participant averaged". A legend should be provided for the rightmost panels.  

      We agree and therefore edited Figure 4. The beginning of the caption mentioned by the reviewer now reads: “A) The panels represent the average duration between events for each contrast level, averaged across participants and trials (stimulus and response respectively as first and last events) for accuracy (top) and speed instructions (bottom).”. Additionally we added legends for the SAT instructions and the model fits.

      (11) Line 189: argued for a decision-making role of what?  

      Stafford and Gurney (2004) proposed that Pieron’s law could reflect a non-linear transformation from sensory input to action outcomes, which they argued reflected a response mechanism. We (Van Maanen et al., 2012) specified this result by showing that a Bayesian Observer Model in which evidence for two alternative options was accumulated following Bayes Rule indeed predicted a power relation between the difference in sensory input of the two alternatives, and mean RT. However, the current data suggest that such an explanation cannot be the full story, as also noted by R3. To clarify this point we replaced the comment by the following sentence:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron-like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014 for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (12) Table 2: There is an SAT effect even on the first interval, which is quite remarkable and could be discussed more - does this mean that the C1 component occurs earlier under speed pressure? This would be the first such finding.  

      The original event we qualified as a P100 was sensitive to SAT but the earliest event is now the N40 and isn’t statistically sensitive to speed pressure in this data. We believe that the fact that the P100 is still sensitive to SAT is not a surprise and therefore do not outline it.

      (13) Line 221: "decrease of activation when contrast (and thus difficulty) increases" - is this shown somewhere in the paper?  

      The whole section for this analysis was rewritten (see comment below)

      (14) I find the analysis of Figure 5 interesting, but the interpretation odd. What is found is that the peak of the decision signal aligns with the response, consistent with previous work, but the authors choose to interpret this as the decision signal "occurring as a short-lived burst." Where is the quantitative analysis of its duration across trials? It can at least be visually appraised in the surface plot, and this shows that the signal has a stimulus-locked onset and, apart from the slowest RTs, remains present and for the most part building, until response. What about this is burst-like? A peak is not a burst.  

      This was the residue of a previous version of the paper where an analysis reported that no evidence accumulation trace was found. But after proper simulations this analysis turned out to be false because of a poor statistical test. Thus we removed this paragraph in the revised manuscript and Figure 5 has now been extended to include surface plots for all the events.

      Reviewer #2 (Recommendations for the authors):  

      Overall, I really enjoyed reading this paper. However, in some places the approach is a bit opaque or the results are difficult to follow. As I read the paper, I noted:  

      Did you do a simple DDM, or did you do a collapsing bound for speed?  

      The fitted DDM was an adaptation of the proportional rate diffusion model. We make this clearer at the end of the introduction: "Given that Fechner’s law is expected to capture decision difficulty we connected this law to the classical diffusion decision models by replacing the rate of accumulation with Fechner’s law in the proportional rate diffusion model of Palmer et al.(2005).”

      It is confusing that the order of intervals in the text doesn't match the order in the table. It might be better to say what events the interval is between rather than assuming that the reader reconstructs.  

      We agree and adapted the order in both the text and the table. The table is now also more explicit (e.g. RT instead of S-R)

      Otherwise, I do wonder to what extent the method is able to differentiate processes that yield similar scalp topographies and find it a bit concerning that no motor component was identified.  

      We believe that the new version with the LRP/CPP is a demonstration that the method can handle similar topographies. The method can handle events with close topographies as long as they are separate in time, however if they are not sequential to one another the method cannot capture both events. We now discuss this, in relation with the C1/P100 overlap, in the discussion section “Visual encoding time”:

      “Nevertheless this event, seemingly overlapping with the P100 even at the trial level (Figure 5C), cannot be recovered by the method we applied. The fact that the P100 was recovered instead of the C1 could indicate that only the timing of the P100 contributes to the RT (see Section 3 of Weindel et al., 2024)”.

      And we more generally address the question of overlap in the new section “Generalization and limitation”.

      Reviewer #3 (Recommendations for the authors):  

      Major Comments:  

      (1) If we agree on one thing, it is that motor processes contribute to response time. Line 364: "In the case of decision-making, these discrete neural events are visual encoding, attention-orientation, and decision commitment, and their latency make up the reaction time." Does the third event, "decision commitment", capture both central parietal positivity (decision deliberation) and motor components? If so, how can the authors attribute the effects to decision deliberation as opposed to motor preparation?  

      Thanks to the suggestions also in the public part. This main problem is now addressed as we do capture both a motor component and a decision commitment.

      Line 351 suggests that the third event may contain two components.  

      This was indeed our initial, badly written, hypothesis. Nevertheless the new solution again addresses this problem.

      The time series in Figure 6 shows an additional peak that is not evident in the simulated ramp of Appendix 1.  

      This was probably due to the overlap of both the CPP and the LRP. It is now much clearer that the CPP looks mostly like a ramp while the LRP looks much more like a burst-like/peaked activity. We make this clear in the “Decision event” paragraph of the discussion section:

      “Regarding the build-up of this component, the CPP is seen as originating from single-trial ramping EEG activities but other work (Latimer et al., 2015; Zoltowski et al., 2019) have found support for a discrete event at the trial-level. The ERPs on the trial-by-trial centered event in Figure 5 show support for both accounts. As outlined above, the LRP is indeed a short burst-like activity but the build-up of the CPP between high vs low contrast diverges much earlier than its peak”.

      Previous analyses (Weindel et al., 2024) found motor-related activity from central parietal topographies close to the response by comparing the difference in single-trial events on left- vs right-hand response trials. The authors suggest at line 315 that the use of only the right hand for responding prevented them from identifying a motor event.  

      The use of only the right hand should have made the event more identifiable because the topography would be consistent across trials (rather than inverting on left vs right hand response trials).  

      The reviewer is correct, in the original manuscript we didn’t test for lateralization, but the comment of the reviewer gave us the idea to explicitly test for the asymmetry (Figure 3). This test now clearly shows what would be expected for a motor event with a strong negativity over the left motor cortex.

      The authors state on line 422 that the EEG data were truncated at the time of the response.  

      Could this have prevented the authors from identifying a motor event that might overlap with the timing of the response?  

      We thank the reviewer for this suggestion. This would have been a possibility but the problem is that adding samples after the response also adds the post-response processes (error monitoring, button release, stimulus disappearance, etc.). While increasing the samples after the response is definitely something that we need to inspect, we think that the separation we achieved in this revision doesn’t call for this supplementary analysis.

      The largest effects of contrast on the third event amplitude appear around the peak as opposed to the ramp. If the peak is caused by the motor component, how does this affect the conclusions that this third event shows a decision-deliberation parietal processes as opposed to a motor process (a number of studies suggest a causal role for motor processes in decision-making e.g. Purcell et al., 2010 Psych Rev; Jun et al., 2021 Nat Neuro; Donner et al., 2009 Curr Bio).  

      This result now changed and it does look like the peak capturing most of the effect is no longer true. We do however think that there might be some link to theories of motor-related accumulation. We therefore added this to the discussion in the Motor execution section:

      “Based on all these observations, it is therefore very likely that this LRP event signs the first passage of a two-step decision process as suggested by recent decision-making models (Servant et al., 2021; Verdonck et al., 2021; Balsdon et al., 2023)”.

      I would suggest further investigation into the motor component (perhaps by extending the time window of analysed EEG to a few hundred ms after the response) and at least some discussion of the potential contribution of motor processes, in relation to the previous literature.  

      We believe that the absence of a motor component is sufficiently addressed in the revised manuscript and in the responses to the other comments.    

      (2) What do we learn from this work? Readers would appreciate more attention to previous findings and a clearer outline of how this work differs. Two points stand out, outlined below. I believe the authors can address these potential complaints in the introduction and discussion, and perhaps provide some clarification in the presentation of the results.  

      In the introduction, the authors state that "... to date, no study has been able to provide single-trial evidence of multiple EEG components involved in decision-making..." (line 64). Many readers would disagree with this. For example, Philiastides, Ratcliff, & Sadja (2006) use a single-trial analysis to unravel early and late EEG components relating to decision difficulty and accuracy (across different perceptual decisions), which could be related to the components in the current work. Other, network-based single-trial EEG analyses (e.g., Si et al., 2020, NeuroImage, Sturn et al., 2016 J Neurosci Methods) could also be related to the current component approach. Yet other approaches have used inverse encoding models to examine EEG components related to separable decision processes within trials (e.g., Salvador et al., 2022, Nat Comms). The results of the current work are consistent with this previous work - the two components from Philiastides et al., 2006 can be mapped onto the components in the current work, and Salvador et al., 2022 also uncover stimulus- and decision-deliberation related components.  

      We completely agree with the reviewer that the link to previous work was insufficient. We now include all references that the reviewer points out both in the introduction (see response R3.2) and in the discussion (see response R3.4). We wish to thank the reviewer for bringing these papers to our attention as they are important for the manuscript.

      The authors relate their components to ERPs. This prompts the question of whether we would get the same results with ERP analyses (and, on the whole, the results of the current work are consistent with conclusions based on ERP analyses, with the exception of the missing motor component). It's nice that this analysis is single-trial, but many of the follow-up analyses are based on grouping by condition anyway. Even the single-trial analysis presented in Figure 4 could be obtained by median splits (given the hypotheses propose opposite directions of effects, except for the linear model). 

      We do not agree with the reviewer in the sense that classical ERP analyses would require much more data-points. The performance of the method is here to use the information shared across all contrast levels to be able to model the processing time of a single contrast level (6 trials per participant). Furthermore, as stated in the response to R1.4 and R1.5, the aim of the paper is to have the time of information processing components which cannot be achieved with classical ERPs without strong, and likely false, assumptions.

      Medium Comments:  

      (1) The presentation of Piéron's law for the behavioural analysis is confusing. First, both laws should be clearly defined for readers who may be unfamiliar with this work. I found the proposal that Piéron's law predicts decreasing RT for increasing pedestal contrast in a contrast discrimination paradigm task surprising, especially given the last author's previous work. For example, Donkin and van Maanen (2014) write "However, the commonality ofPiéron's Law across so many paradigms has lead researchers (e.g., Stafford & Gurney, 2004; Van Maanen et al., 2012) to propose that Piéron's Law is unrelated to stimulus scaling, but is a result of the architecture of the response selection (or decision making) process." The pedestal contrast is unrelated to the difficulty of the contrast discrimination task (except for the consideration of Fechner's law). Instead, Piéron's law would apply to the subjective difference in contrast in this task, as opposed to the pedestal contrast. The EEG results are consistent with these intuitions about Piéron's law (or more generally, that contrast is accumulated over time, so a later EEG component for lower pedestal contrast makes sense): pedestal contrast should lead to faster detection, but not necessarily faster discrimination. Perhaps, given the complexity of the manuscript as a whole, the predictions for the behavioural results could be simplified?  

      We agree that the initial version was confusing. We now clarified the presentation of Piéron's law at the end of the introduction (see also response to R2).

      Once Fechner's law is applied, decision difficulty increases with increasing contrast, so Piéron's law on the decision-relevant intensity (perceived difference in contrast) would also predict increasing RT with increasing pedestal contrast. It is unlikely that the data are of sufficient resolution to distinguish a log function from a power of a log function, but perhaps the claim on line 189 could be weakened (the EEG results demonstrate Piéron's law for detection, but do not provide evidence against Piéron's law in discrimination decisions).  

      This is an excellent observation, thank you for bringing it to our attention. Indeed, the data support the notion that Pieron’s law is related to detection, but do not rule out that it is also related to decision or discrimination. In earlier work, we (Donkin & Van Maanen, 2014) addressed this question as well, and reached a similar conclusion. After fitting evidence accumulation models to data, we found no linear relationship between drift rates and stimulus difficulty, as would have been the case if Pieron's law could be fully explained by the decision process (as -indirectly- argued by Stafford & Gurney, 2004; Van Maanen et al., 2012). The fact that we observed evidence for a non-linear relationship between drift rates and stimulus difficulty led us to the same conclusion, that Pieron’s law could be reflected in both discrimination and decision processes. We added the following comment to the discussion about the functional locus of Pieron's law to clarify this point:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014, for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (2) Appendix 1 shows that the event detection of the HMP method will also pick up on ramping activity. The description of the problem in the introduction is that event-like activity could look like ramping when averaged across trials. To address this problem, the authors should simulate events (with some reasonable dispersion in timing such that they look like ramping when averaged) and show that the HMP method would not pull out something that looked like ramping. In other words, the evidence for ramping in this work is not affected by the previously identified confounds.  

      We agree that this demonstration was necessary and thus added the suggested simulation to Appendix 1. As can be seen in the Figure 1 of the appendix, when we simulate a half-sine the average ERP based on the timing of the event looks like a half-sine.

      (3) Some readers may be interested in a fuller discussion of the failure of the Fechner diffusion model in the speed condition.  

      We are unsure which failure the reviewer refers to but assumed it was in relation to the behavioral results and thus added: 

      It is unlikely that neither Piéron nor Fechner law impact the RT in the speed condition. Instead this result is likely due to the composite nature of the RT where both laws co-exist in the RT but cancel each other out due to their opposite prediction.

      Minor Comments:  

      (1) "By-trial" is used throughout. Normally, it is "trial-by-trial" or "single-trial" or "trial-wise".

      We replaced all occurrences of “by-trial”  with the three terms suggested were appropriate.

      (2) Line 22: "The sum of the times required for the completion of each of these precessing steps is the reaction time (RT)." The total time required. Processing.  

      Corrected for both.

      (3) Line 26/27: "Despite being an almost two century old problem (von Helmholtz, 2021)." Perhaps the citation with the original year would make this point clearer.  

      We agree and replaced the citation.

      (4) Line 73: "accounted by estimating". Accounted for by estimating.  

      Corrected.

      (5) Line 77 "provides an estimation on the." Of the.  

      Corrected.

      (6) Line 86: "The task of the participants was to answer which of two sinusoidal gratings." The picture looks like Gabor's? Is there a 2d Gaussian filter on top of the grating? Clarify in the methods, too.  

      We incorrectly described the stimuli as those were indeed just Gabor’s. This is now corrected both in the main text and the method section.

      (7) Figure 1 legend: "The Fechner diffusion law" Fechner's law or your Fechner diffusion model?  

      Law was incorrect so we changed to model as suggested.

      (8) Line 115: "further allows to connects the..." Allows connecting the.  

      Corrected.

      (9) Line 123: "lower than 100 ms or higher than..." Faster/slower.  

      Corrected.

      (10) Line 131: "To test what law." Which law.?  

      Corrected to model.

      (11) Figure 2 legend: "Left: Mean RT (dot) and average fit (line) over trials and participants for each contrast level used." The fit is over trials and participants? Each dot is? Average trials for each contrast level in each participant?  

      This sentence was corrected to “Mean RT (dot) for each contrast level and averaged predictions of the individual fits (line) with Accuracy (Top) and Speed (Bottom) instructions.”.

      (12) Line 231: "A comprehensive analysis of contrast effect on". The effect of contrast on.  

      This title was changed to “functional interpretation of the events”.

      (13) Line 23: "the three HMP event with". Three HMP events.

      The sentence no longer exists in the revised manuscript.

      (14) Line 270: "Secondly, we computed the Pearson correlation coefficient between the contrast averaged proportion of correct." Pearson is for continuous variables. Proportion correct is not continuous. Use Spearman, Kendall, or compute d'.  

      The reviewer rightly pointed out our error, we corrected this by computing Spearman correlation.

      (15)  Line 377: "trial 𝑛 + 1 was randomly sampled from a uniform distribution between 0.5 and 1.25 seconds." It's just confusing why post-response activity in Figure 5 does look so consistent. Throughout methods: "model was fitted" should be "was fit", and line 448, "were split".  

      We do not have a specific hypothesis of why the post-response activity in the previous Figure 5 was so consistent. Maybe the Gaussian window (same as in other manuscripts with a similar figure, e.g. O’Connell et al. 2012) generated this consistency. We also corrected the errors mentioned in the methods.

      (16) The linear mixed models paragraph is a bit confusing. Can it clearly state which data/ table is being referred to and then explain the model? "The general linear mixed model on proportion of correct responses was performed using a logit link. The linear mixed models were performed on the raw milliseconds scale for the interval durations and on the standardized values for the electrode match." We go directly from proportion correct to raw milliseconds...  

      The confusion was indeed due to the initial inclusion of a general linear mixed model on proportion correct which was removed as it was not very informative. The new revision should be clearer on the linear mixed models (see first sentence of subsection ‘linear mixed models' in the method section).

      (17) A fuller description of the HMP model would be appreciated.  

      We agree that this was necessary and added the description of the HMP model in the corresponding method section “Hidden multivariate pattern” in addition to a more comprehensive presentation of HMP in the first paragraph of the Result and Discussion sections.

      (18) Line 458: "Fechner's law (Fechner, 1860) states that the perceived difference (𝑝) between the two patches follows the logarithm of the difference in physical intensity between..." ratio of physical intensity.  

      Corrected.

      (19) P is defined in equations 2 and 4. I would include the beta in equation 4, like in equation 2, then remove the beta from equations 3 and 5 (makes it more readable). I would also just include the delta in equation 2, state that in this case, c1 = c+delta/2 or whatever.  

      This indeed makes the equation more readable so we applied the suggestions for equations 2, 3, 4 and 5. The delta was not added in equation 2 but instead in the text that follows:

      “Where 𝐶1 = 𝐶0 + 𝛿, again with a modality and individual specific adjustment slope (𝛽).” 

      (20) The appendix suggests comparing the amplitudes with those in Figure 3, but the colour bar legend is missing, so the reader can only assume the same scale is used?  

      We added the color bar as it was indeed missing. Note though that the previous version displayed the estimation for the simulated data while this plot in the revised manuscript shows the solution on real data obtained after downsampling the data (and therefore look for a larger pattern as in the main text). We believe that this representation is more useful given that the solution for the downsampled data is no longer the same as the one in the main text (due to the difference in pattern width).

    1. he pervasiveness of these formats means that our culture uses the style and content of these shows as ways to interpret reality. For example, think about a TV news program that frequently shows heated debates between opposing sides on public policy issues. This style of debate has become a template for handling disagreement to those who consistently watch this type of program.

      This passage explains that when we watch certain media styles over and over, we start using them to understand real life. If a news show always shows loud, heated arguments, viewers may think that’s the “normal” way to handle disagreements. Media formats can quietly shape how people act and communicate.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We are grateful to the reviewers for their thoughtful and constructive evaluations of our manuscript. Their comments helped us clarify key aspects of the study and strengthen both the presentation and interpretation of our findings. The central goal of this work is to dissect how the opposing activities of GATA4 and CTCF coordinate chromatin topology and transcriptional timing during human cardiomyogenesis. The reviewers’ feedback has allowed us to refine this message and better contextualize our results within the broader framework of chromatin regulation and cardiac development.

      In response to the reviews, in our preliminary revision we have already implemented substantial improvements to the manuscript, including additional analyses, clearer data visualization, and revisions to the text to avoid overinterpretation. These refinements enhance the robustness of our conclusions without altering the overall scope of the study. A small number of additional analyses and experiments are ongoing and will be added to the full revision, as detailed below.

      We believe that the revised manuscript, together with the planned updates, fully addresses the reviewers’ concerns and substantially strengthens the contribution of this work to the field.

      Reviewer 1 – Point 1:

      In the datasets you are examining, what are the relative percentages in each of the four groups relating compartmentalization change to expression change (A→B, expression up; A→B, down; B→A, up; B→A, down)?

      We quantified compartment–expression relationships using Hi-C and bulk RNA-seq from H9 ESCs and CMs. The percentages for each category are shown below and incorporated into updated Figure S2H.

      Group

      Downregulated in CM

      Upregulated in CM

      A-to-A

      11.92%

      8.44%

      A-to-B

      18.20%

      2.79%

      B-to-A

      7.96%

      18.07%

      B-to-B

      14.36%

      6.44%

      A chi-squared test comparing observed vs. expected distributions (based on gene density across bins) confirmed a strong association between compartment dynamics and transcriptional behavior. B-to-A genes are significantly enriched among genes upregulated in CMs, while A-to-B genes are enriched among those downregulated (updated Figure S2H).

      We next assessed with GSEA how these gene classes respond to GATA4 and CTCF knockdown. In 2D CMs, GATA4 knockdown reduces expression of CM-upregulated B-to-A genes and increases expression of CM-downregulated A-to-B genes, whereas CTCF knockdown produces the opposite pattern (updated Figure 2F).

      Applying the same analysis to cardioid bulk RNA-seq (updated Figure 4E) revealed the strongest effects in SHF-RV organoids, consistent with monolayer data. In SHF-A organoids, only GATA4 knockdown had a measurable impact on CM-upregulated B-to-A and CM-downregulated A-to-B genes. Because the subsets of CM-downregulated B-to-A and CM-upregulated A-to-B genes were very small and showed no consistent trends, Figure 4 focuses on the two informative categories only. The full classification is provided in Reviewer Figure 1 below.

      (The figure cannot be rendered in this text-only format)

      Reviewer Figure 1. GSEA for CM-upregulated B-to-A and CM-downregulated A-to-B genes. p-values by Adaptive Monte-Carlo Permutation test.

      Reviewer 1 – Point 2

      This phrase in the abstract is imprecise: ‘whereas premature CTCF depletion accelerates yet confounds cardiomyocyte maturation.’


      The abstract has been revised to: “whereas premature CTCF depletion accelerates yet alters cardiomyocyte maturation.” (lines 29-30).

      Reviewer 1 – Point 3

      Regarding this statement: "Disruption of [3D chromatin architecture] has been linked to genetic dilated cardiomyopathy (DCM) caused by lamin A/C mutations8,9, and mutations in chromatin regulators are strongly enriched in de novo congenital heart defects (CHD)10, underscoring their pathogenic relevance11." The first studies to implicate chromatin structural changes in heart disease, including the role of CTCF in that process, were PMID: 28802249, a model of acquired, rather than genetic, disease.

      We added the following sentence to the paragraph introducing CTCF: “Moreover, depletion of CTCF in the adult cardiomyocytes leads to heart failure28,29.” (line 72)

      Reviewer 1 – Point 4

      Can you quantify this statement: ‘the compartment switch coincided with progressive reduction of promoter–gene body interactions’?

      We quantified promoter–gene body contacts by calculating the area under the curve (AUC) of the virtual 4C signal derived from H9 Hi-C data across differentiation. As a result of this analysis we added the following sentence: “Quantitatively, interactions between the TTN promoter and its gene body decreased by ~55% from the pluripotent stage to day 80 cardiomyocytes.” (lines 89-91).


      Reviewer 1 – Point 5

      Regarding this statement: "six regions became less accessible in CMs, correlating with ChIP-seq signal for the ubiquitous architectural protein CTCF." I don't see 6 ATAC peaks in either TTN trace in Figure 1A.

      We corrected the text as it follows: “TTN experienced clear changes in chromatin accessibility during CM differentiation: ATAC-seq identified two CM-specific peaks that correlated with ChIP-seq signal for the cardiac pioneer TF GATA4 at the two promoters, one driving full length titin and the other the shorter cronos isoform. In contrast, two regions became less accessible in CMs, correlating with two of the six ChIP-seq peaks for the ubiquitous architectural protein CTCF” (lines 93-97). We attribute the differences between ChIP-seq and ATAC-seq profiles to methodological sensitivity and/or biological variability between datasets generated in different laboratories and cell batches.

      Reviewer 1 – Point 6

      Western blots need molecular weight markers.

      We edited the relevant panels accordingly (updated Figures 1E and 2B).

      Reviewer 1 – Point 7

      Regarding this statement: "The decrease in CTCF protein levels may explain its selective detachment from TTN during cardiomyogenesis." At face value, these findings suggest the opposite: i.e. that a massive downregulation of CTCF at protein level should affect its binding across the genome, which is not tested and is hard to evaluate between ChIP-seq studies from different groups and from different developmental timeframes.

      We revised the text to avoid implying selective detachment and performed a genome-wide analysis of CTCF occupancy using ENCODE ChIP-seq datasets generated by the same laboratory with matched protocols in hESCs and hESC-derived CMs. This analysis shows that 43.2% of CTCF sites present in ESCs are lost in CMs, whereas only 5.7% are gained, confirming a broad reduction in CTCF binding during differentiation. These results are now included in__ updated Figure 1B__.

      Reviewer 1 – Point 8a

      A couple thoughts on the FISH experiments in Figure 2. A claim of 'impaired B-A transition' would be more convincing if you show, by FISH, that the relative distance of TTN from lamin B increases with differentiation.

      Although prior work from us and others has established that TTN transitions from the nuclear periphery in hESCs to a more internal position during cardiomyogenesis (Poleshko et al. 2017; Bertero et al. 2019a), we are reproducing this trajectory in WTC11 hiPSCs as part of the FISH experiments for the full revision.

      __Reviewer 1 – Point 8b __

      In the [FISH] images: are you showing a total projection of all z planes? One assumes the quantitation is relative to a 3D reconstruction in which the lamin B signal is restricted to the periphery. Have you shown this? __

      Quantification was performed on full 3D reconstructions from Z-stacks, as detailed in the Methods (lines 721-727). While the original submission displayed maximum-intensity projections, updated Figure 2D and Figure S2E now show representative single optical sections, which more clearly highlight the spatial relationship between the TTN locus and the nuclear lamina.

      Reviewer 1 – Point 8c

      Lastly, these data are very interesting and important, provoking reexamination of your interpretation of the results in Figure 1. Figure 1 was interpreted to show that less CTCF binding led to decreased lamina (and thus B compartment) association during development. Figure 2 shows that depleting CTCF does not change association of TTN with lamina.

      Our interpretation is that by day 25 of hiPSC-CM differentiation the TTN locus may have reached its maximal radial repositioning even in control cells, limiting the ability to detect earlier effects of CTCF depletion. To test whether CTCF knockdown accelerates lamina detachment at earlier stages, we are repeating the FISH analysis for the inducible CTCF knockdown line at multiple time points during differentiation.

      Reviewer 1 – Point 9

      A thought about this statement: "Altogether, these results suggest that GATA4 and CTCF function as positive and negative regulators of B-to-A compartment switching, likely acting through global and local chromatin remodeling, respectively." GATA4 induces TTN expression and its knockdown prevents TTN expression-the evidence that GATA4 affects compartmentalization is unclear. By activating the gene, GATA4 may shift TTN to B classification.

      Our current data do not allow us to disentangle whether GATA4-driven transcriptional activation precedes or follows the B-to-A compartment shift. We have therefore removed the mechanistic speculation from this sentence to avoid overinterpretation. Nevertheless, the analyses in updated Figure 2F, discussed in the response to Reviewer 1 - Point 1, show that GATA4 knockdown preferentially reduces expression of CM-upregulated B-to-A genes, while CTCF knockdown has the opposite effect, supporting the conclusion that both factors influence the transcriptional programs associated with B-to-A transitions.

      Reviewer 1 – Point 10

      __I'm not sure what I am looking at in Figure 3C. Are those traces integration of interactions over a defined window? "Each [mutant is] clearly different from WT" is not obvious from the presentation. The histograms are plotting AUC of what? Interactions of those peaks with the mutated region? I genuinely appreciate how laborious this experiment must have been and encourage you to explain better what you are showing. __

      We revised the main text to avoid overstating the differences (“clearly” “in a similar manner”, line 192) and expanded the l__egends of updated Figures 3C–D__ to clarify what is being shown: “(C) 4C-seq in hiPSCs using the promoter-proximal region of TTN as viewpoint. The top panel shows raw interaction profiles. The lower panels plot pairwise differences between conditions to reveal subtle changes. A schematic indicating the 4C viewpoint is included for clarity. Right inset: zoom of the CBS4–5 region. Mean of n = 3 cultures. (D) AUC of the differential 4C-seq signal for defined intervals (panel C). p-values by one-sample t-test against μ = 0.”. We also added a visual cue in updated Figure 3C indicating the 4C viewpoint to facilitate interpretation.

      Reviewer 1 – Point 11

      Again acknowledging how challenging these experiments are: when you mutant a locus, you change CTCF binding but you also change the DNA. Thus, attributing the changes in interactions to presence/absence of CTCF binding is difficult, because the DNA substrate itself has changed. Perhaps you are presenting all of this as a negative result, given the modest effect on transcription, which is as important as a positive result, given the assumptions usually made about such things. But the results are not clearly described and your interpretation seems to go between implying the structural change causative and being agnostic.

      We recognize that deleting a genomic region can affect both CTCF binding and the DNA substrate itself. For this reason, we implemented two parallel genome-editing strategies:

      (1) a straightforward Cas9-mediated deletion of ~100 bp centered on each CBS, and

      (2) a more precise HDR approach replacing only the 20 bp core CTCF motif.

      Because the HDR strategy succeeded, all downstream analyses were carried out on these minimal edits, which substantially limit disruption of other transcription factor motifs and reduce the likelihood of sequence-dependent polymer effects unrelated to CTCF.

      Nevertheless, to avoid implying unwarranted causality in the absence of more conclusive evidence, we added a paragraph to the Discussion outlining these limitations, including the sentence: “Our study also reflects general challenges in separating chromatin-architectural and transcriptional mechanisms. Although the CBS edits were restricted to the core CTCF motifs, additional sequence-dependent effects cannot be fully excluded, and we therefore interpret the resulting changes as consistent with—but not exclusively due to—loss of CTCF binding.” (lines 365-368)

      Reviewer 1 - Point 12.

      Figure 4C: since you have RNA-seq data, a much more objective way to present these data would be to show all data (again, A-B, up; A-B, down; B-A, up; B-A, down) and the effects of CTCF or GATA4. Regardless, you can still focus on the cardiac specific genes. But my guess is if you examine all genes, the pattern you show in panel C will not be present in the majority of cases. Furthermore, if this hypothesis is wrong, such an analysis will allow you to identify other genes affected by the mechanisms you describe and your analysis will test whether these mechanisms are in fact conserved at different loci.

      As outlined in our response to Point 1, we extended the analysis to all genes undergoing compartment changes and incorporated this into the cardioid RNA-seq dataset. This revealed a clear and consistent relationship between GATA4 or CTCF knockdown and the expression of B-to-A and A-to-B gene classes (updated Figure 4E).

      Reviewer 2 - Point 1.1

      1. CTCF regulation at TTN locus:

      (1) Figure 1A: The claim of the authors about convergent CTCF sites and transcriptional activation of TTN is quite simplistic. This claim is only valid when we know where cohesin is loaded. If cohesin is loaded at then intragenic GATA4 binding site, then the only important CTCF sites is at the promoter of TTN. I suggest that the authors read few more publications which may help the authors to better understand how cohesin and CTCF team up to regulate transcription, such as Hsieh et al., Nature Genetics, 2022; Liu et al., Nature Genetics, 2021; Rinzema et al., Nature Structural and Molecular Biology, 2022.

      __Suggestion: The authors should add cohesin (RAD21/SMC1A) and NIPBL ChIP-seq for better interpretation. __

      In line with the reviewer’s insightful suggestion, we integrated cohesin ChIP-seq data into updated Figure 1A. Specifically, we added a RAD21 ChIP-seq track from hESCs, which provides direct evidence of cohesin occupancy across the TTN locus. RAD21 binding closely parallels CTCF binding at five sites within the gene body, supporting a model in which promoter-proximal CTCF anchors cohesin to stabilize repressive loops at this locus. This analysis substantially strengthens the mechanistic framework and is consistent with the studies recommended by the reviewer, which we have now cited (lines 68 and 104).

      Reviewer 2 - Point 1.2. (2) Figure 3B: If delta2CBS only has heterozygenous deletion of CBS6, why we would expect the binding will be weaken to 50%. However, the CTCF binding is reduced to around 1/10 in the ChIP-qPCR. How do the authors explain this?

      Sequencing of the Δ2CBS line shows that one CBS6 allele carries the intended EcoRI replacement, while the second allele contains a 2-bp deletion within the core CTCF motif (Figure S3C). Remarkably, this small deletion is sufficient to abolish CTCF binding, resulting in complete loss of occupancy at CBS6 despite heterozygosity. We clarified this in the text as follows: “CTCF ChIP-qPCR in hiPSCs confirmed complete loss of CTCF binding at the targeted sites, including CBS6 in the Δ2CBS line, indicating that the 2-bp deletion sufficed to disrupt CTCF binding while occupancy at other CBSs remained unaffected.” (lines 187–189).

      Reviewer 2 - Point 1.3a (3) Figure 3C: There are two problems with the 4C experiments: (a) The changes are really mild. In fact, none of the p-values in Figure 3D are significant.

      The effect of deleting CBS1 is indeed modest, consistent with reports that individual CTCF binding sites often show functional redundancy (i.e., Rodríguez-Carballo et al. 2017; Barutcu et al. 2018; Kang et al. 2021). Nevertheless, our 4C-seq experiments have reproducibly shown the same directional trend across biological replicates. To increase statistical power and more rigorously assess the robustness of this effect, we are generating additional 4C replicates as part of the full revision.

      Reviewer 2 - Point 1.3b [In the 4C experiments] (b) The authors should also consider a model that CTCF directly serves as a repressor. In this way, 3D genome may not be involved. B-A switch is simply caused by the activation of the locus.

      We now explicitly acknowledge this possibility in the Discussion. The revised text states: “Moreover, our data cannot unambiguously separate CTCF’s architectural role from potential direct repressive activity. Both mechanisms could contribute to the observed effects, and our findings likely reflect the combined influence of CTCF on chromatin topology and gene regulation.” (lines 368–371).

      Reviewer 2 - Point 2.1a 2. __(CTCF) detachment: The authors mentioned few times "detachment". In the context of this manuscript, the authors indicate detachment from nuclear lamina. However, the authors haven't provide convincing evidence about this. __

      In the two instances where we used the term “detachment,” we intended it to refer exclusively to reduced CTCF binding to DNA, not to lamina repositioning. To avoid ambiguity, we have replaced “detachment” with “reduced binding” in both locations (lines 123 and 329). We do not use this term to describe TTN–lamina positioning.

      Reviewer 2 - Point 2.1b (1) Figure 1D: I doubt whether such changes of CTCF protein abundance will lead to LAD detachment. Suggest the authors read van Schaik et al., Genome Biology, 2022. With the full depletion of CTCF, the effects on LADs are still very restricted.

      We agree that the observed correlation between reduced CTCF levels and the relocation of TTN away from a LAD does not establish causality. As outlined in our response to Reviewer 1 – Point 8c, we are performing additional FISH experiments at earlier differentiation stages in the CTCF inducible knockdown line to directly assess whether partial CTCF depletion is sufficient to alter the timing of TTN–lamina separation.

      Reviewer 2 - Point 2.2 (2) Figure 2D: Lamin B1 should be mostly at nuclear periphery. I have few questions: (1) is the antibody specific? (2) do these cells carry mutation in LMNB1 gene? (3) is the staining actually LMNA?

      As also clarified in response to Reviewer 1 – Point 8b, the original images displayed maximum-intensity projections of Z-stacks, which obscured the peripheral distribution of LMNB1. We have updated Figure 2D and Figure S2E to show representative individual optical sections, which more clearly display the expected peripheral LMNB1 signal. We also confirm that the antibody used is specific for LMNB1 and previously validated (Bertero et al. 2019b), and that the WTC11-derived lines used in this study carry no mutation in LMNB1.

      Reviewer 2 - Point 3

      3. Opposite functions of GATA4 and CTCF: These data in Figure 5E-H argues the opposite role of GATA4 and CTCF in transcriptional regulation. Would it be that CTCF KD just affected cell proliferation, which is actually known for many cell types, rather than affect CM differentiation process? If this is the reason, inversed correlation between CTCF KD and GATA4 KD in Figure 4D could also be explained by opposite effects on cell cycle.

      We directly evaluated this possibility. In FHF–LV cardioids, cell cycle profiling in Figure 6C and Figure S6C (now S7C) showed that CTCF knockdown does not alter the distribution of CMs across G1/S/G2–M phases, in contrast to the marked increase in proliferation observed with GATA4 knockdown.

      Because this comment referred specifically to the SHF data, we also analyzed mitotic gene expression in the SHF–RV bulk RNA-seq dataset using GSEA. CTCF knockdown did not significantly enrich any cell cycle–related gene sets, whereas GATA4 knockdown produced a strong enrichment for mitotic cell cycle terms, in line with FHF-LV data (Reviewer Figure 2).

      These results are summarized in updated Figure S5C, reporting also the results of the broader GSEA analysis, and together indicate that the transcriptional divergence between CTCF and GATA4 knockdown is not simply explained by opposing effects on proliferation.

      (The figure cannot be rendered in this text-only format)

      Reviewer Figure 2. GSEA for mitotic cell cycle in SHF-RV after inducible knockdown of CTCF (left) or GATA4 (right). p-values by Adaptive Monte-Carlo Permutation test.

      Reviewer 2 - Point 4 4. In discussion, the authors suggested that CTCF is a local chromatin remodeller. In my view, association with local chromatin compaction doesn't qualify CTCF as a chromatin remodeler. To my knowledge, CTCF does not have an enzymatic domain, then how does it remodel chromatin?

      Our intended meaning was that CTCF shapes 3D chromatin architecture through its role in organizing intergenic looping, not that it remodels chromatin enzymatically. To avoid confusion, we have removed the original sentence from the Discussion.

      Reviewer 2 - Point 5. 5. Some conclusions are drawn based on insignificant p-values, e.g. Figure 2F, Figure 3D, etc. The authors should be careful about their conclusion, and tone down their statement for the observations have borderline significance.

      The conclusions based on bulk RNA-seq have been revised in response to Reviewer 1 – Point 1 (updated Figure 2F). By subsetting B-to-A and A-to-B genes according to their expression dynamics, this analysis now yields clearer and statistically significant differences between conditions.

      Regarding the 4C-seq data, as acknowledged in Reviewer 2 – Point 3a, the observed effects are modest. We are generating additional biological replicates to increase statistical power. In the meantime, we have adjusted the text to avoid overstating these findings. The revised manuscript now states: “While the difference did not reach significance, these trends suggest …” (lines 199–200).

      Reviewer 2 - Minor comment 1. Minor comments: 1. Figure 1A: (1) I suggest to label two promoters in the gene model. It's unclear in the figure in the current version; (2) I was a bit confused with the way how the authors labeled CTCF directionality. I thought there are a lot of promoters. Why didn't they use triangles?

      We updated Figure 1A to label both TTN promoters and indicate their orientation. For CTCF sites, we now clearly display the motif direction and core binding region as determined by FIMO analysis of the CTCF ChIP-seq peaks, improving consistency and interpretability.

      Reviewer 2 - Minor comment 2. 2. Figure 2C: I think the drastical reduction of titin-mEGFP levels is only due to the way how the authors analyze their FACS data. Can the author quantify on median fluorescence intensity?

      The gating strategy for titin-mEGFP⁺ cells was defined using a reporter-negative control, and cells lacking TNNT2 expression showed no detectable titin-mEGFP signal, confirming the specificity of the gate. To complement this analysis, we also quantified the median fluorescence intensity (MFI) of titin-mEGFP⁺ cells. The MFI analysis corroborates the original findings, showing a significant decrease in GATA4 knockdown and an increase in CTCF knockdown (updated Figure S2D).

      __Reviewer 2 - Minor comment 3. 3. Figure S2G: P value should be -log10, I assume. Please label it accurately. __

      We appreciate the reviewer pointing out this labeling error. In the revised manuscript, this panel has been removed to accommodate the updated compartment–expression analysis now presented in updated Figure 2H (see response to Reviewer 1 – Point 1), and the issue is no longer applicable.

      References

      Barutcu AR, Maass PG, Lewandowski JP, Weiner CL, Rinn JL. 2018. A TAD boundary is preserved upon deletion of the CTCF-rich Firre locus. Nat Commun 9: 1444.

      Bertero A, Fields PA, Ramani V, Bonora G, Yardımcı GG, Reinecke H, Pabon L, Noble WS, Shendure J, Murry CE. 2019a. Dynamics of genome reorganization during human cardiogenesis reveal an RBM20-dependent splicing factory. Nature communications 10: 1538.

      Bertero A, Fields PA, Smith AS, Leonard A, Beussman K, Sniadecki NJ, Kim D-H, Tse H-F, Pabon L, Shendure J, et al. 2019b. Chromatin compartment dynamics in a haploinsufficient model of cardiac laminopathy. Journal of Cell Biology 218: 2919–44.

      Kang J, Kim YW, Park S, Kang Y, Kim A. 2021. Multiple CTCF sites cooperate with each other to maintain a TAD for enhancer–promoter interaction in the β-globin locus. The FASEB Journal 35: e21768.

      Poleshko A, Shah PP, Gupta M, Babu A, Morley MP, Manderfield LJ, Ifkovits JL, Calderon D, Aghajanian H, Sierra-Pagán JE, et al. 2017. Genome-Nuclear Lamina Interactions Regulate Cardiac Stem Cell Lineage Restriction. Cell 171: 573–587.

      Rodríguez-Carballo E, Lopez-Delisle L, Zhan Y, Fabre PJ, Beccari L, El-Idrissi I, Huynh THN, Ozadam H, Dekker J, Duboule D. 2017. The HoxD cluster is a dynamic and resilient TAD boundary controlling the segregation of antagonistic regulatory landscapes. Genes Dev 31: 2264–2281.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Becca et al. characterized the functions of GATA4 and CTCF in the context of cardiomyogenesis. The authors aim to establish a link between 3D genome changes (A/B compartment and long-range chromatin interactions) and activation of cardiac specific genes such as TTN. They showed opposite effects of GATA4 and CTCF in regulating these genes as well as phenotypical traits. I have the following suggestions and questions:

      Major comments:

      1. CTCF regulation at TTN locus:

      (1) Figure 1A: The claim of the authors about convergent CTCF sites and transcriptional activation of TTN is quite simplistic. This claim is only valid when we know where cohesin is loaded. If cohesin is loaded at then intragenic GATA4 binding site, then the only important CTCF sites is at the promoter of TTN. I suggest that the authors read few more publications which may help the authors to better understand how cohesin and CTCF team up to regulate transcription, such as Hsieh et al., Nature Genetics, 2022; Liu et al., Nature Genetics, 2021; Rinzema et al., Nature Structural and Molecular Biology, 2022.

      Suggestion: The authors should add cohesin (RAD21/SMC1A) and NIPBL ChIP-seq for better interpretation. (2) Figure 3B: If delta2CBS only has heterozygenous deletion of CBS6, why we would expect the binding will be weaken to 50%. However, the CTCF binding is reduced to around 1/10 in the ChIP-qPCR. How do the authors explain this?

      (3) Figure 3C: There are two problems with the 4C experiments: (a) The changes are really mild. In fact, none of the p-values in Figure 3D are significant; (b) The authors should also consider a model that CTCF directly serves as a repressor. In this way, 3D genome may not be involved. B-A switch is simply caused by the activation of the locus. 2. (CTCF) detachment: The authors mentioned few times "detachment". In the context of this manuscript, the authors indicate detachment from nuclear lamina. However, the authors haven't provide convincing evidence about this.

      (1) Figure 1D: I doubt whether such changes of CTCF protein abundance will lead to LAD detachment. Suggest the authors read van Schaik et al., Genome Biology, 2022. With the full depletion of CTCF, the effects on LADs are still very restricted.

      (2) Figure 2D: Lamin B1 should be mostly at nuclear periphery. I have few questions: (1) is the antibody specific? (2) do these cells carry mutation in LMNB1 gene? (3) is the staining actually LMNA? 3. Opposite functions of GATA4 and CTCF: These data in Figure 5E-H argues the opposite role of GATA4 and CTCF in transcriptional regulation. Would it be that CTCF KD just affected cell proliferation, which is actually known for many cell types, rather than affect CM differentiation process? If this is the reason, inversed correlation between CTCF KD and GATA4 KD in Figure 4D could also be explained by opposite effects on cell cycle. 4. In discussion, the authors suggested that CTCF is a local chromatin remodeller. In my view, association with local chromatin compaction doesn't qualify CTCF as a chromatin remodeler. To my knowledge, CTCF does not have an enzymatic domain, then how does it remodel chromatin? 5. Some conclusions are drawn based on insignificant p-values, e.g. Figure 2F, Figure 3D, etc. The authors should be careful about their conclusion, and tone down their statement for the observations have borderline significance.

      Minor comments:

      1. Figure 1A: (1) I suggest to label two promoters in the gene model. It's unclear in the figure in the current version; (2) I was a bit confused with the way how the authors labeled CTCF directionality. I thought there are a lot of promoters. Why didn't they use triangles?
      2. Figure 2C: I think the drastical reduction of titin-mEGFP levels is only due to the way how the authors analyze their FACS data. Can the author quantify on median fluorescence intensity?
      3. Figure S2G: P value should be -log10, I assume. Please label it accurately.

      Significance

      Strengths and limitations:

      I feel that single-cell analysis and functional analysis of GATA4 and CTCF using cardiac organoid model are elegant. However, the weak part of the manuscript is the link between 3D genome and activation of TTN. I also think the authors should include more possible explanations for the interpretation of some genome organization data (CTCF site deletion, 4C, etc).

      Advance: The study does provide useful information to understand transcriptional regulation during cardiac lineage specification. The link between 3D genome and cardiac lineage specification is conceptually nice but needs more data to support.

      Audience: developmental biologists who is interested in heart development and molecular biologists with specific interests in gene regulation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This report demonstrates that the gene expression output of the Wnt pathway, when controlled precisely by a synthetic light-based input, depends substantially on the frequency of stimulation. The particular frequency-dependent trend that is observed - anti-resonance, a suppression of target gene expression at intermediate frequencies given a constant duty cycle - is a novel aspect that has not been clearly shown before for this or other signaling pathways. The paper provides both clear experimental evidence of the phenomenon with engineered cellular systems and a model-based analysis of how the pairing of rate constants in pathway activation/deactivation could result in such a trend.

      Strengths:

      This report couples in vitro experimental data with an abstracted mathematical model. Both of these approaches appear to be technically sound and to provide consistent and strong support for the main conclusion. The experimental data are particularly clear, and the demonstration that Brachyury expression is subject to anti-resonance in ESCs is particularly compelling. The modeling approach is reasonably scaled for the system at the level of detail that is needed in this case, and the hidden variable analysis provides some insight into how the anti-resonance works.

      Weaknesses:

      (1) The anti-resonance phenomenon has not been demonstrated using physiological Wnt ligands; however, I view this as only a minor weakness for an initial report of the phenomenon. The potential significance of the phenomenon for Wnt outweighs the amount of effort it would take to carry the demonstration further - testing different frequencies/duty cycles at the level of ligand stimulus using microfluidics could get quite involved, and would likely take quite some time. Adding some more discussion about how the time scales of ligand-receptor binding could play into the reduced model would further ameliorate this issue.

      We thank the reviewer for this comment and the interesting suggestion to test the anti-resonance phenomenon with microfluidics. We agree that combining physiological Wnt ligands with microfluidic stimulation would go beyond the scope of this current study, though it is an interesting extension. One advantage of the optogenetic setup, as mentioned in the discussion, is that the Wnt stimulus can be turned off sharply. This allows us to test the output from perfectly square wave input profiles; in microfluidics, washing the sticky ligand off the cells might “smear” the effective input profile cells respond to.

      We show in Supplement Fig. 6, that our reduced model matches the experimental data and that we would expect the antiresonance phenomenon as long as (see Fig. 4). Practically, a smeared input profile implies an effective reduction of 𝑘<sub>off</sub>, which means that the phenomenon would be visible with microfluidics (provided the minimum is deep enough, see Fig. 4). However, this should still be considered with caution, as the antiresonance would then appear because the cells essentially receive a smeared out or continuous pulse in the high frequency limit, rather than cells responding to a square wave in a specific way.

      (2) While the model is fully consistent with the data, it has not been validated using experimental manipulations to establish that the mechanisms of the cell system and the model are the same. There may be some ways to make such modifications, for example, using a proteasome inhibitor. An alternative would be to more explicitly mention the need to validate the model's mechanism with experiments.

      We thank the reviewer for this valuable and constructive comment. We agree that future experimental perturbations that directly modulate pathway activation and reset kinetics—such as proteasome inhibition, targeted degradation of pathway components, or engineered changes in receptor turnover—would provide an important validation of the model’s mechanistic interpretation. In the present study, our primary goal was to establish the existence and quantitative features of anti-resonance in the Wnt pathway and to identify the minimal set of timescale relationships that can explain it. We view the proposed experimental validations as exciting next steps that extend beyond the scope of the current work, and we are grateful to the reviewer for emphasizing their importance. We now mention this explicitly in the discussion of our manuscript.

      (3) I think the manuscript misses an opportunity to discuss the potential of the phenomenon in other pathways. The hedgehog pathway, for example, involves GSK3-mediated partial proteolysis of a transcription factor, which could conceivably be subject to similar behaviors, and there are certainly other examples as well.

      We thank the reviewer for pointing out an opportunity to emphasize the possibility of this phenomenon in other pathways. The minimal model indicates that anti-resonance emerges whenever a rapid activating process is paired with a slower deactivating/reset process. Beyond Hedgehog/Gli processing, candidate circuits include: NF-κB (rapid IκBα phosphorylation/degradation vs slower IκBα resynthesis), ERK (fast phosphorylation bursts vs slower transcriptional negative feedback such as DUSPs), Notch (fast γ-secretase NICD release vs slower NICD turnover and feedback), BMP/TGF-β–SMAD (fast R-SMAD phosphorylation vs slower receptor trafficking/SMAD7 feedback), and Hippo/YAP (rapid cytoplasmic sequestration vs slower transcriptional feedback). Each contains the same timescale separation that should create a frequency ‘stop-band,’ predicting suppressed gene expression or fate transitions at intermediate stimulation frequencies. We have updated the manuscript’s discussion to mention the Hedgehog connection with the following added sentence in the discussion: Analogous band-stop filtering should arise in other developmental circuits that couple a fast ‘ON’ step to slower deactivation or negative feedback. In Hedgehog, for example, PKA/CK1/GSK3-mediated partial proteolysis of Gli with slower recovery of full-length Gli creates the same fast-activation/slow-reset motif our hidden-variable model predicts will yield anti-resonance, and Wnt–Hedgehog crosstalk through the shared kinase GSK3 suggests such frequency selectivity could occur in other developmental signaling pathways.

      We also added an additional sentence regarding different activation and deactivation timescales in other pathways.

      (4) Some aspects of the modeling and hidden variable analysis are not optimally presented in the main text, although when considered together with the Supplemental Data, there are no significant deficiencies.

      We have addressed the model choices and analysis now more clearly in the main manuscript and also referred to the Supplemental Data more directly.

      Reviewer #2 (Public review):

      Summary:

      By combining optogenetics with theoretical modelling, the authors identify an anti-resonance behavior in the WnT signaling pathway. This behavior is manifested as a minimal response at a certain stimulation frequency. Using an abstracted hidden variable model, the authors explain their findings by a competition of timescales. Furthermore, they experimentally show that this anti-resonance influences the cell fate decision involved in human gastrulation.

      Strengths:

      (1) This interdisciplinary study combines precise optogenetic manipulation with advanced modelling.

      (2) The results are directly tested in two different systems: HEK293T cells and H9 human embryonic stem cells.

      (3) The model is implemented based on previous literature and has two levels of detail: i) a detailed biochemical model and ii) an abstract model with a hidden parameter.

      Weaknesses:

      (1) While the experiments provide both single-cell data and population data, the model only considers population data.

      We thank the reviewer for correctly pointing out that the single-cell measurements would in principle allow us to incorporate the cell-to-cell heterogeneity into the model. In this study, we sought to identify a minimal quantitative model of the Wnt pathway that could explain anti-resonance through competing time scales. We believe that, for our purposes, focusing on population data allowed us to keep the complexity of the model to a minimum to increase its explanatory value. We agree with the reviewer that considering single-cell trajectories is an interesting direction for further work.

      (2) Although the model captures the experimental data for TopFlash very well, the beta-Cat curves (Figure 2B) are only described qualitatively. This discrepancy is not discussed.

      Indeed, our model fits to mean β-catenin expressions are more qualitative than for TopFlash. The fit for β-catenin was tricky, as expression of β-catenin is typically low and closer to the detectable limits than TopFlash. These experimental constraints mean that the variation between individual signal trajectories is higher for β-catenin compared to the light-off condition than for TopFlash. Therefore, we strove to obtain a qualitative rather than a quantitative fit to the mean expression profile in β-catenin.  The current model fit is well within the standard deviation of variation. Given the observed heterogeneity and the fact that we take the parameters from literature (which ensures that the order of magnitude of parameters is in a sensible range), we believe that the model fits are reasonable. We now mention this explicitly in the text.

      Overall Assessment:

      The authors convincingly identified an anti-resonance behavior in a signaling pathway that is involved in cell fate decisions. The focus on a dynamic signal and the identification of such a behavior is important. I believe that the model approach of abstracting a complicated pathway with a hidden variable is an important tool to obtain an intuitive understanding of complicated dependencies in biology. Such a combination of precise ontogenetic manipulation with effective models will provide a new perspective on causal dependencies in signaling pathways and should not be limited only to the system that the authors study.

      We thank both reviewers for the positive assessment of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There are several points that deserve more discussion, as noted above in the review.

      (1) It would be worthwhile to consider whether a relatively simple experiment with a proteasome inhibitor or similar pharmacological manipulation could provide useful validation data for the model.

      We address this point above in the weaknesses section from reviewer 1.

      (2) The figure legend for S5C should clarify whether the values plotted are at a particular fixed time point, or (more likely) at a certain time following the second pulse, which would be variable.

      We have modified the figure caption to clarify that the values plotted are at a fixed time point in the simulation (t\=48 hrs). We chose this timepoint sufficiently long after the second pulse to ensure that there are no residual dynamical effects. We thank the reviewer for noting this.

      (3) As noted in the Sci Score document, various aspects of the resource reporter should be improved, such as including RRIDs, etc.

      We are sending out our plasmids to AddGene; versions for Python and Matlab are listed in our methods section.

      Reviewer #2 (Recommendations for the authors):

      I mostly have suggestions to improve the clarity of the presentation.

      (1) Not all symbols in the equations given in the main text are explained. This is rather annoying, because either you present them and explain what they are or you don't show them and refer to the supplements. For example, d_0 or c_o or \bar{b} or n or K are not explained.

      We have now more clearly presented the parameters in the main text and added signposts to the Methods section.

      (2) Overall, it is often not clear what data in the figures are redundant, although the authors referred to them in the text. For example, in Figure 2c, a curve for 24 hours is shown and referred back to Figure 1D. However, in Figure 1D there is no curve for 24 hours. Is the data from Supplementary Figure 1 H and K also in the main text?

      We thank the referee for pointing out these redundancies. We have now included the 24hr line in Figure 1D and are now only showing the unsmoothed data, also in the main text of the manuscript. To clarify supplemental figures, we have now removed S1H and S1K since all they showed was the unsmoothed version of the data. The remaining plots in Supplementary Figure 1 are normalized differently from what we show in Figure 1 to demonstrate our choice of normalization is not the reason for the observed optogenetic response.

    1. Author response:

      Reviewer #1 (Public Review):

      Lai and Doe address the integration of spatial information with temporal patterning and genes that specify cell fate. They identify the Forkhead transcription factor Fd4 as a lineage-restricted cell fate regulator that bridges transient spatial transcription factors to terminal selector genes in the developing Drosophila ventral nerve cord. The experimental evidence convincingly demonstrates that Fd4 is both necessary for lateborn NB7-1 neurons, but also sufficient to transform other neural stem cell lineages toward the NB7-1 identity. This work addresses an important question that will be of interest to developmental neurobiologists: How can cell identities defined by initial transient developmental cues be maintained in the progeny cells, even if the molecular mechanism remains to be investigated? In addition, the study proposes a broader concept of lineage identity genes that could be utilized in other lineages and regions in the Drosophila nervous system and in other species. 

      Thanks for the accurate summary and positive comments!

      While the spatial factors patterning the neuroepithelium to define the neuroblast lineages in the Drosophila ventral nerve cord are known, these factors are sometimes absent or not required during neurogenesis. In the current work, Lai and Doe identified Fd4 in the NB7-1 lineage that bridges this gap and explains how NB7-1 neurons are specified after Engrailed (En) and Vnd cease their expression. They show that Fd4 is transiently co-expressed with En and Vnd and is present in all nascent NB7-1 progenies. They further demonstrate that Fd4 is required for later-born NB7-1 progenies and sufficient for the induction of NB7-1 markers (Eve and Dbx) while repressing markers of other lineages when force-expressed in neural progenitors, e.g., in the NB56 lineage and in the NB7-3 lineage. They also demonstrate that, when Fd4 is ectopically expressed in NB7-3 and NB5-6 lineages, this leads to the ectopic generation of dorsal muscle-innervating neurons. The inclusion of functional validation using axon projections demonstrates that the transformed neurons acquire appropriate NB7-1 characteristics beyond just molecular markers. Quantitative analyses are thorough and well-presented for all experiments.

      Thanks for the positive comments!

      (1) While Fd4 is required and sufficient for several later-born NB7-1 progeny features, a comparison between early-born (Hb/Eve) and later-born (Run/Eve) appears missing for pan-progenitor gain of Fd4 (with sca-Gal4; Figure 4) and for the NB7-3 lineage (Figure 6). Having a quantification for both could make it clearer whether Fd4 preferentially induces later-born neurons or is sufficient for NB7-1 features without temporal restriction.

      We quantified the percentage of Hb+ and Runt+ cells among Eve+ cells with sca-gal4, and the results are shown in Figure 4-figure supplement 1. We found that the proportion of early-born cells is slightly reduced but the proportion of later-born cells remain similar. Interestingly, we also found a subset of Eve+ cells with a mixed fate (Hb+Runt+) but the reason remains unclear.

      (2) Fd4 and Fd5 are shown to be partially redundant, as Fd4 loss of function alone does not alter the number of Eve+ and Dbx+ neurons. This information is critical and should be included in Figure 3.

      Because every hemisegment in an fd4 single mutant is normal, we just added it as the following text: “In fd4 mutants, we observe no change in the number of Eve+ neurons or Dbx+ neurons (n=40 hemisegments).”

      (3) Several observations suggest that lineage identity maintenance involves both Fd4dependent and Fd4-independent mechanisms. In particular, the fact that fd4-Gal4 reporter remains active in fd4/fd5 mutants even after Vnd and En disappear indicates that Fd4's own expression, a key feature of NB7-1 identity, is maintained independently of Fd4 protein. This raises questions about what proportion of lineage identity features require Fd4 versus other maintenance mechanisms, which deserves discussion.

      We agree, thanks for raising this point. We add the following text to the Discussion. “Interestingly, the fd4 fd5 mutant maintains expression of fd4:gal4, suggesting that the fd4/fd5 locus may have established a chromatin state that allows “permanent” expression in the absence of Vnd, En, and Fd4/Fd5 proteins.”

      (4) Similarly, while gain of Fd4 induces NB7-1 lineage markers and dorsal muscle innervation in NB5-6 and NB7-3 lineages, drivers for the two lineages remain active despite the loss of molecular markers, indicating some regulatory elements retain activity consistent with their original lineage identity. It is therefore important to understand the degree of functional conversion in the gain-of-function experiments. Sparse labeling of Fd4 overexpressing NB5-6 and NB7-3 progenies, as was done in Seroka and Doe (2019), would be an option.

      We agree it is interesting that the NB7-3 and NB5-6 drivers remain on following Fd4 misexpression. To explore this, we used sca-gal4 to overexpress Fd4 and observed that Lbe expression persisted while Eg was largely repressed (see Author response image 1 below). The results show that Lbe and Eg respond differently to Fd4. A non-mutually exclusive possibility is that the continued expression of lbe-Gal4 UAS-GFP or eg-Gal4 UAS-GFP may be due to the lengthy perdurance of both Gal4 and GFP.

      Author response image 1.

      (5) The less-penetrant induction of Dbx+ neurons in NB5-6 with Fd4-overexpression is interesting. It might be worth the authors discussing whether it is an Fd4 feature or an NB56 feature by examining Dbx+ neuron number in NB7-3 with Fd4-overexpression.

      In the NB7-3 lineages misexpressing Fd4, only 5 lineages generated Dbx+ cells (0.1±0.4, n=64 hemisegments), suggesting that the low penetrance of Dbx+ induction is an intrinsic feature of Fd4 rather than lineage context. We have added this information in the results section. 

      (6) It is logical to hypothesize that spatial factors specify early-born neurons directly, so only late-born neurons require Fd4, but it was not tested. The model would be strengthened by examining whether Fd4-Gal4-driven Vnd rescues the generation of laterborn neurons in fd4/fd5 mutants.

      When we used en-gal4 driver to express UAS-vnd in the fd4/fd5 mutant background, we found an average 7.4±2.2 Eve+ cells per hemisegment (n=36), significantly higher than fd4/fd5 mutant alone (3.9±0.8 cells, n=52, p=2.6x10<sup.-11</sup>) (Figure 3J). In addition, 0.2±0.5 Eve+ cells were ectopic Hb+ (excluding U1/U2), indicating that Vnd-En integration is sufficient to generate both early-born and late-born Eve+ cells in the fd4/fd5 mutants. We have added the results to the text.

      (7) It is mentioned that Fd5 is not sufficient for the NB7-1 lineage identity. The observation is intriguing in how similar regulators serve distinct roles, but the data are not shown. The analysis in Figure 4 should be performed for Fd5 as supplemental information.

      Thanks for the suggestion. Because the results are exactly the same as the wild type, we don’t think it is necessary to provide an additional images or analysis as supplemental information.

      Reviewer #2 (Public review):

      Via a detailed expression analysis, they find that Fd4 is selectively expressed in embryonic NB7-1 and newly born neurons within this lineage. They also undertake a comprehensive genetic analysis to provide evidence that fd4 is necessary and sufficient for the identity of NB7-1 progeny. 

      Thanks for the accurate summary!

      The analysis is both careful and rigorous, and the findings are of interest to developmental neurobiologists interested in molecular mechanisms underlying the generation of neuronal diversity. Great care was taken to make the figures clear and accessible. This work takes great advantage of years of painstaking descriptive work that has mapped embryonic neuroblast lineages in Drosophila. 

      Thanks for the positive comments!

      The argument that Fd4 is necessary for NB7-1 lineage identity is based on a Fd4/Fd5 double mutant. Loss of fd4 alone did not alter the number of NB7-1-derived Eve+ or Dbx+ neurons. The authors clearly demonstrate redundancy between fd4 and fd5, and the fact that the LOF analysis is based on a double mutant should be better woven through the text.

      The authors generated an Fd5 mutant. I assume that Fd5 single mutants do not display NB7-1 lineage defects, but this is not stated. The focus on Fd4 over Fd5 is based on its highly specific expression profile and the dramatic misexpression phenotypes. But the LOF analysis demonstrates redundancy, and the conclusions in the abstract and through the results should reflect the existence of Fd5 in the conclusions of this manuscript.

      We agree, and have added new text to clarify the single mutant phenotypes (there are none) and the double mutant phenotype (loss of NB7-1 molecular and morphological features. The following text is added to the manuscript: “Not surprisingly, we found that fd4 single mutants or fd5 single mutants had no phenotype (Eve+ neurons were all normal). Thus, to assess their roles, we generated a fd4 and fd5 double mutant. Because many Eve+ and Dbx+ cells are generated outside of NB7-1 lineage, it was also essential to identify the Eve+ or Dbx+ cells within NB7-1 lineage in wild type and fd4 mutant embryos. To achieve this, we replaced the open reading frame of fd4 with gal4 (called fd4-gal4) (see Methods); this stock simultaneously knocked out both fd4 and fd5 (called fd4/fd5 mutant hereafter) while specifically labeling the NB7-1 lineage. For the remainder of this paper we use the fd4/fd5 double mutant to assay for loss of function phenotypes.”

      It is notable that Fd4 overexpression can rewire motor circuits. This analysis adds another dimension to the changes in transcription factor expression and, importantly, demonstrates functional consequences. Could the authors test whether U4 and U5 motor axon targeting changes in the fd4/fd5 double mutant? To strengthen claims regarding the importance of fd4/fd5 for lineage identity, it would help to address terminal features of U motorneuron identity in the LOF condition.

      Thanks for raising this important point. We examined the axon targeting on body wall muscles in both wild type and in fd4/fd5 mutant background and added the results in Figure 3-figure supplement 2. We found that the axon targeting in the late-born neuron region (LL1) is significantly reduced, suggesting that the loss of late-born neurons in fd4/fd5 mutant leads to the absence of innervation of corresponding muscle targets.

      Reviewer #3 (Public review):

      The goal of the work is to establish the linkage between the spatial transcription factors (STFs) that function transiently to establish the identities of the individual NBs and the terminal selector genes (typically homeodomain genes) that appear in the newborn postmitotic neurons. How is the identity of the NB maintained and carried forward after the spatial genes have faded away? Focusing on a single neuroblast (NB 7-1), the authors present evidence that the fork-head transcription factor, fd4, provides a bridge linking the transient spatial cues that initially specified neuroblast identity with the terminal selector genes that establish and maintain the identity of the stem cell's progeny. 

      Thanks for the positive comments!

      The study is systematic, concise, and takes full advantage of 40+ years of work on the molecular players that establish neuronal identities in the Drosophila CNS. In the embryonic VNC, fd4 is expressed only in the NB 7-1 and its lineage. They show that Fd4 appears in the NB while the latter is still expressing the Spatial Transcription Factors and continues after the expression of the latter fades out. Fd4 is maintained through the early life of the neuronal progeny but then declines as the neurons turn on their terminal selector genes. Hence, fd4 expression is compatible with it being a bridging factor between the two sets of genes. 

      Thanks for the accurate summary!

      Experimental support for the "bridging" role of Fd4 comes from a set of loss-of-function and gain-of-function manipulations. The loss of function of Fd4, and the partially redundant gene Fd5, from lineage 7-1 does not aoect the size of the lineage, but terminal markers of late-born neuronal phenotypes, like Eve and Dbx, are reduced or missing. By contrast, ectopic expression of fd4, but not fd5, results in ectopic expression of the terminal markers eve and Dbx throughout diverse VNC lineages. 

      Thanks for the accurate summary!

      A detailed test of fd4's expression was then carried out using lineages 7-3 and 5-6, two well-characterized lineages in Drosophila. Lineage 7-3 is much smaller than 7-1 and continues to be so when subjected to fd4 misexpression. However, under the influence of ectopic Fd4 expression, the lineage 7-3 neurons lost their expected serotonin and corazonin expression and showed Eve expression as well as motoneuron phenotypes that partially mimic the U motoneurons of lineage 7-1.

      Thanks for the positive comments!

      Ectopic expression of Fd4 also produced changes in the 5-6 lineage. Expression of apterous, a feature of lineage 5-6, was suppressed, and expression of the 7-1 marker, Eve, was evident. Dbx expression was also evident in the transformed 5-6 lineages, but extremely restricted as compared to a normal 7-1 lineage. Considering the partial redundancy of fd4 and fd5, it would have been interesting to express both genes in the 5-6 lineage. The anatomical changes that are exhibited by motoneurons in response to Fd4 expression confirm that these cells do, indeed, show a shift in their cellular identity.

      We appreciate the positive comments. We agree double misexpression of Fd4 and Fd5 might give a stronger phenotype (as the reviewer says) but the lack of this experiment does not change the conclusions that Fd4 can promote NB7-1 molecular and morphological aspects at the expense of NB5-6 molecular markers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The study introduces an open-source, cost-effective method for automating the quantification of male social behaviors in Drosophila melanogaster. It combines machine-learning-based behavioral classifiers developed using JAABA (Janelia Automatic Animal Behavior Annotator) with inexpensive hardware constructed from off-the-shelf components. This approach addresses the limitations of existing methods, which often require expensive hardware and specialized setups. The authors demonstrate that their new "DANCE" classifiers accurately identify aggression (lunges) and courtship behaviors (wing extension, following, circling, attempted copulation, and copulation), closely matching manually annotated groundtruth data. Furthermore, DANCE classifiers outperform existing rule-based methods in accuracy. Finally, the study shows that DANCE classifiers perform as well when used with low-cost experimental hardware as with standard experimental setups across multiple paradigms, including RNAi knockdown of the neuropeptide Dsk and optogenetic silencing of dopaminergic neurons.

      The authors make creative use of existing resources and technology to develop an inexpensive, flexible, and robust experimental tool for the quantitative analysis of Drosophila behavior. A key strength of this work is the thorough benchmarking of both the behavioral classifiers and the experimental hardware against existing methods. In particular, the direct comparison of their low-cost experimental system with established systems across different experimental paradigms is compelling.

      While JAABA-based classifiers have been previously used to analyze aggression and courtship (Tao et al., J. Neurosci., 2024; Sten et al., Cell, 2023; Chiu et al., Cell, 2021; Isshi et al., eLife, 2020; Duistermars et al., Neuron, 2018), the demonstration that they work as well without expensive experimental hardware opens the door to more low-cost systems for quantitative behavior analysis.

      We thank the reviewer for their positive assessment and constructive suggestions. We have cited these additional JAABA studies in the Introduction. We clarified that several prior JAABA-based classifiers were developed using specialized machinevision cameras or custom setups, and that in some cases the original code and classifiers were not made publicly available, which limits reproducibility and wider adoption. To address this, we explicitly note in the revised manuscript that DANCE was developed with accessibility in mind.

      Although the study provides a detailed evaluation of DANCE classifier performance, its conclusions would be strengthened by a more comprehensive analysis. The authors assess classifier accuracy using a bout-level comparison rather than a frame-level analysis, as employed in previous studies (Kabra et al., Nat Methods, 2013). They define a true positive as any instance where a DANCE-detected bout overlaps with a manually annotated ground-truth bout by at least one frame. This criterion may inflate true positive rates and underestimate false positives, particularly for longer-duration courtship behaviors. For example, a 15-second DANCE-classified wing extension bout that overlaps with ground truth for only one frame would still be considered a true positive. A frame-level analysis performance would help address this possibility.

      We thank the reviewer for raising this important point. Our original use of bout-level analysis followed existing literature (Duistermars et al., 2018; Ishii et al., 2020; Chiu et al., 2021; Tao et al., 2024; Hindmarsh Sten et al., 2025). While our lunge classifier already operates at the frame level, we have now performed additional frame-level evaluations for the duration based courtship classifiers. These analyses revealed only minor differences in precision, recall, and F1 scores compared with the original bout-level approach (see new Figure 5—Figure Supplement 3). Details of this analysis are now included in the Materials and Methods.

      In summary, this work provides a practical and accessible approach to quantifying Drosophila behavior, reducing the economic barriers to the study of the neural and molecular mechanisms underlying social behavior.

      We thank the reviewer for their encouraging comments and for recognizing the accessibility and practical value of our approach. We appreciate the constructive suggestions, which have helped strengthen the manuscript.

      Reviewer #2 (Public review):

      Summary:

      This manuscript addresses the development of a low-cost behavioural setup and standardised open-source high-performing classifiers for aggression and courtship behaviour. It does so by using readily available laboratory equipment and previously developed software packages. By comparing the performance of the setup and the classifiers to previously developed ones, this study shows the classifier's overperformance and the reliability of the low-cost setup in recapitulating previously described effects of different manipulations on aggression and courtship.

      Strengths:

      The newly developed classifiers for lunges, wing extension, attempted copulation, copulation, following, and circling, perform better than available previously developed ones. The behavioural setup developed is low cost and reliably allows analysis of both aggression and courtship behaviour, validated through social experience manipulation (social isolation), gene knock (Dsk in Dilp2 neurons) and neuronal inactivation (dopaminergic neurons) known to affect courtship and aggression.

      We thank the reviewer for the clear summary of our work and for highlighting its strengths. We appreciate these positive comments and suggestions, which have helped improve the clarity of the manuscript.

      Weaknesses:

      Aggression encompasses multiple defined behaviours, yet only lunges were analysed. Moreover, the CADABRA software to which DANCE was compared analyses further aggression behaviours, making their comparisons incomplete. In addition, though DANCE performs better than CADABRA and Divider in classifying lunges in the behavioural setup tested, it did not yield very high recall and F1 scores.

      We thank the reviewer for raising this important point. We focused on lunges because they are widely used as a standard proxy for male aggression across multiple laboratories (Agrawal et al., 2020; Asahina et al., 2014; Chiu et al., 2021; Chowdhury et al., 2021; Dierick et al., 2007; Hoyer et al., 2008; Jung et al., 2020; Nilsen et al., 2004; Watanabe et al., 2017). As noted in the Discussion, our study also provides a template for the future development of additional aggression classifiers (fencing, wing flick, tussle, chase, female headbutt) and courtship classifiers (tapping, licking, rejection), which can be trained and shared through the same DANCE framework. Developing and validating these was beyond the scope of the present work.

      To address the concern regarding precision, recall, and F1 scores, we performed additional analyses across all training videos and compiled these results in the new Figure 2—Figure Supplement 2. Our earlier lunge classifier had performance metrics obtained after training on a total of 11 videos. Our analysis shows performance metrics for classifiers trained on four independent datasets (Videos 8– 11). We found that the classifier trained on nine videos provided the best balance of precision, recall, and F1 (78.73%, 73.07%, and 75.79%, respectively), which was slightly better than the earlier classifier. We therefore updated the main figure, text, and Materials and Methods to use this version and uploaded the corresponding classifier and training details to the GitHub repository. 

      DANCE is of limited use for neuronal circuit-level enquiries, since mechanisms for intensity and temporally controlled optogenetic manipulations, which are nowadays possible with open-source software and low-cost hardware, were not embedded in its development.

      We thank the reviewer for this valuable point. The primary aim of DANCE is to provide an accessible, modular, and low-cost behavioural recording and analysis platform. It was designed so that users can readily integrate additional components such as optogenetic control when needed. As a proof of concept, we implemented optogenetic silencing of dopaminergic neurons using the DANCE hardware and confirmed that this manipulation increased aggression (Figure 7R). 

      To facilitate adoption, we now provide schematic diagrams, LED control code, and instructions on our GitHub page and setup photographs in the manuscript (see new Figure 7—Figure Supplement 1). The released code allows programmable timing and intensity control, enabling users to reproduce temporally precise optogenetic protocols or extend the system for other stimulation paradigms.

      Reviewer #3 (Public review):

      The preprint by Yadav et al. describes a new setup to quantify a number of aggression and mating behaviors in Drosophila melanogaster. The investigation of these behaviors requires the analysis of a large number of videos to identify each kind of behavior displayed by a fly. Several approaches to automatize this process have been published before, but each of them has its limitations. The authors set out to develop a new setup that includes very low-cost, easy-to-acquire hardware and open-source machine-learning classifiers to identify and quantify the behavior.

      Strengths:

      (1) The study demonstrates that their cheap, simple, and easy-to-obtain hardware works just as well as custom-made, specialized hardware for analyzing aggression and mating behavior. This enables the setup to be used in a wide range of settings, from research with limited resources to classroom teaching.

      (2) The authors used previously published software to train new classifiers for detecting a range of behaviors related to aggression and mating and to make them freely available. The classifiers are very positively benchmarked against a manually acquired ground truth as well as existing algorithms.

      (3) The study demonstrates the applicability of the setup (hardware and classifiers) to common methods in the field by confirming a number of expected phenotypes with their setup.

      We thank the reviewer for the positive assessment of our work and for highlighting its strengths. We appreciate these encouraging comments and suggestions, which have helped improve the clarity and presentation of the manuscript.

      Weaknesses:

      (1) When measuring the performance of the duration-based classifiers, the authors count any bout of behavior as true positive if it overlaps with a ground-truth positive for only 1 frame - despite the minimal duration of a bout is 10 frames, and most bouts are much longer. That way, true positives could contain cases that are almost totally wrong as long there was an overlap of a single frame. For the mating behaviors that are classified in ongoing bouts, I think performance should be evaluated based on the % of correctly classified frames, not bouts.

      We thank the reviewer for raising this concern. In response to this point, and to Reviewer #1’s similar comment, we performed a frame-level evaluation of all duration-based courtship classifiers. The analysis revealed only minor differences compared with the original bout-level metrics (see new Figure 5—Figure Supplement 3), confirming the robustness of our classifiers. We have also added a description of this analysis in the Materials and Methods section.

      (2) In the methods part, only one of the pre-existing algorithms (MateBook), is described. Given that the comparison with those algorithms is a so central part of the manuscript, each of them should be briefly explained and the settings used in this study should be described.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we expanded the Materials and Methods to include concise descriptions and parameter settings for all pre-existing algorithms used for comparison. This includes dedicated subsections for CADABRA and the Divider assay, with explicit reference to their rulebased or geometric features. For MateBook, we specified the persistence filters used and the adjustments made for fair benchmarking. These changes ensure transparency and reproducibility.

      Taken together, this work can greatly facilitate research on aggression and mating in Drosophila. The combination of low-cost, off-the-shelf hardware and open-source, robust software enables researchers with very little funding or technical expertise to contribute to the scientific process and also allows large-scale experiments, for example in classroom teaching with many students, or for systematic screenings.

      We thank the reviewer for the encouraging comments and for recognizing the accessibility and broad applicability of DANCE. We believe these revisions have further strengthened the manuscript.

      Reviewer #1 (Recommendations for the authors):

      The following comments highlight areas where additional context, clarification, or further analysis could strengthen the manuscript. I hope these suggestions will be useful in refining your work.

      (1) Lines 71-73: The authors state that Ctrax "leads to frequent identity switches among tracked flies, which is not the case while using FlyTracker." However, Ctrax was specifically designed to minimize identity errors, and Kabra et al. (2013) reported a low frequency of such errors-approximately one per five fly-hours in 10-fly videos. In contrast, Caltech FlyTracker does not correct identity errors automatically, requiring manual corrections, as noted in the Methods section of this study. If this is not an oversight, please provide further context to clarify this distinction.

      We thank the reviewer for raising this clarification. As reported by Bentzur et al. (2021), when groups of flies were tracked simultaneously, Ctrax often generated multiple identities for the same individual, sometimes producing more trajectories than the actual number of flies. To prevent ambiguity, we revised the text to read: “While both Ctrax and FlyTracker (Eyjolfsdottir et al., 2014) may produce identity switches, when groups of flies were tracked simultaneously, Ctrax led to inaccuracies that required manual correction using specialized algorithms such as FixTrax (Bentzur et al., 2021).”  We also quantified FlyTracker identity-switch rates in our datasets and report them in new Supplementary File 5, confirming that such events were rare (< 2% of tracked intervals). We believe, this updated version provides the necessary context and ensures accuracy in describing each tracker’s limitations.

      (2) Line 85: Providing additional context on how this study builds on previous work using JAABA-based classifiers for fly social behavior and comparing these classifiers to rule-based methods would more accurately situate it within the field. The authors state that "recently, a few JAABA-based classifiers have been developed for measuring aggression and courtship" and cite four related studies. However, this statement seems to underrepresent the use of JAABA-based classifiers for quantifying fly social behavior, which has become common in the field. Several additional studies (as noted in the public review) have developed JAABA-based classifiers for scoring aggression or courtship. Furthermore, other studies have compared the performance of JAABA-based classifiers with rule-based classifiers like CADABRA (e.g., Chowdhury et al., Comm Biology 2021; Leng et al., PlosOne 2020; Kabra et al., Nat Methods 2013). Mentioning the similar findings in those studies and your own helps strengthen the conclusion that machine-learning-based classifiers outperform rule-based classifiers in several experimental contexts.

      We thank the reviewer for this helpful suggestion. We have revised the Introduction to include additional references to studies that applied JAABA-based classifiers for aggression and courtship and made textual edits to reflect this. We further noted that, unlike several previous studies, all DANCE classifiers and analysis code are publicly available.

      Reviewer #2 (Recommendations for the authors):

      (1) Suggestions for improved or additional experiments, data or analyses: As mentioned in the description of the effect of optogenetic inactivation of dopaminergic neurons, in the conclusion and also reported in the literature, there are other important identified aggression behaviours, such as fencing, wing flick, tussle, and chase. Similarly, for courtship, tapping and licking have also been defined. This study, as opposed to proposed future studies, would benefit from creating opensource classifiers for these established behaviours, which are important for the analysis of aggression and courtship.

      We thank the reviewer for this valuable suggestion. As clarified in the Discussion, this manuscript intentionally focuses on six core, well-validated aggression and courtship behaviors to demonstrate DANCE’s modularity and reproducibility. Developing additional classifiers such as fencing, wing flick, tussle, chase, tapping, and licking would require extensive annotation and validation beyond the present scope. To address this point, we explicitly note in the revised text that the DANCE pipeline is readily extendable, allowing the community to build new classifiers within the same framework.

      In terms of observer bias assessment for ground-truthing in courtship, this was only presented for circling and it would be beneficial to have encompassed all behaviours analysed.

      We thank the reviewer for this suggestion. Observer-bias comparisons for all six classifiers are presented in Figure 2—Figure Supplement 1 (panels A–F). We clarified in the Results that annotations from two independent evaluators were compared for all classifiers, with no significant differences observed, confirming their robustness.

      Finally, intensity and temporal optogenetic control are important for neuronal circuit analysis of underlying behaviour. The authors could embed this aspect in DANCE by integrating control of the green light LED strip used in this study using, for example, the open-source visual reactive programming software Bonsai (Lopes et al., 2015) and open-source electronics platform Arduino. This is an important and valuable addition in line with maintaining low cost.

      We thank the reviewer for this valuable suggestion. DANCE was designed to be modular, allowing integration of temporal optogenetic control. To support immediate adoption, we now provide Arduino LED control code, setup schematics, and photographs (new Figure 7—Figure Supplement 1) along with step-by-step instructions on our GitHub page. We also note that Bonsai and Arduino frameworks are compatible with DANCE, enabling future extensions for closed-loop or behaviortriggered stimulation.

      (2) Minor corrections to the text and figures:

      Figure Supplement 1 refers only to Figure 2, yet panels D-F refer to the behaviour circling in courtship and therefore should be assigned to the respective figure.

      Thanks, we have corrected this.

      In lines 315-316, the cumbersome task of fluon coating for aggression assays seems to be ubiquitous across assays which is not the case, and therefore the sentence should include the word 'some'.

      Thanks, we have edited this.

      The cost of the phone and/or tablet should be included in the DANCE setup costs, as presumably these devices will be dedicated to the behavioural studies, for consistency purposes.

      We thank the reviewer for this comment. We intentionally did not include smartphones or tablets in the setup cost because, in our experiments, these devices were not dedicated exclusively to DANCE but were repurposed from routine personal use. Our aim was to leverage readily available consumer electronics so that their cost does not become a barrier to adoption. We confirmed that commonly available Android phones capable of 30 fps at 1080p in H.264 format, as well as tablets or phones running a simple white-screen light app, are sufficient for reliable behavior classification and illumination. Since these devices can be returned to regular use after recordings, including their cost in the setup would not accurately reflect the intended accessibility of DANCE. For consistency, we now clarify in the Materials and Methods that such devices should be placed in airplane mode during recordings.

      Reviewer #3 (Recommendations for the authors):

      (1) For my taste, the authors put too much emphasis on the point that their method outperforms existing methods. I understand the value in comparing to published methods and it is of course fully justified to state the advantages of the new method. But the whole preprint is set up as a competition with the old algorithms, and the conclusion that the new classifier is better is repeated in each figure caption and after each paragraph of the results. This competitive mindset also extends to the selection of which results are presented as main figures and which as supplements - all cases in which the previous methods actually perform well are only presented in the supplement. I think this is simply unnecessary as the authors' results speak for themselves, and do not need the continuous competitive comparison.

      We thank the reviewer for this thoughtful suggestion. Our intention was to benchmark DANCE rigorously against existing methods, not to frame the study competitively. We agree that repeated emphasis on relative performance was unnecessary. In the revised version, we streamlined figure captions and text throughout the manuscript to balance comparisons and removed redundant phrasing. Instances where other methods performed well are now presented with equal clarity to maintain a neutral and informative tone.

      (2) When describing the DANCE hardware, as a reader I would find it interesting to also read about potential issues that the authors encountered. For example, how difficult is it to handle the materials without breaking or deforming them, which could affect the behavioral assays? How critical is it to use specific blister packs - the availability of which will likely vary strongly between countries? Did the authors try different sizes, and products? Such information, even as a supplement, could be very helpful for the widespread use of the hardware.

      We thank the reviewer for this important point. To address this, we conducted additional tests comparing DANCE arenas of different diameters (new Figure 7— Figure Supplement 3A–C and new Figure 7—Figure Supplement 4A–L). We also consulted colleagues in multiple countries and verified that the blister packs used in our assays are readily available. The Materials and Methods now include practical handling notes: blister foils can be reused ~30–40 times for aggression assays and ~10–15 times for courtship assays before deformation. We also describe how to prevent agar surface damage during assembly and how to wash and dry the arenas for optimal reusability.

      (3) I find the arrows pointing to several videos in a number of figures rather distracting and redundant, and suggest omitting them.

      Thanks, we have omitted these arrows from all relevant figures and clarified the figure legends to enhance readability.

      (4) P8, line 169 ff: this is a very long sentence that should be separated into several sentences.

      We have rewritten this as follows: “DANCE scores remained comparable to groundtruth scores across all categories, whereas CADABRA and Divider underestimated the lunge counts (Figure 2B–E). Correlation analysis revealed a strong relationship between DANCE and ground-truth scores (Figure 2F, Supplementary File 2). In comparison, CADABRA and the Divider assay classifier showed a weaker correlation (Figure 2G-H, Supplementary File 2).”

      (5) P10, line 216: please explain, here and in the methods, how these behavioral indices are calculated. I did not find this information anywhere in the paper.

      We thank the reviewer for pointing this out. We now define the behavioral index explicitly in Materials and Methods: “For each assay, a behavioral index was calculated as the proportion of frames in which the male engaged in the specified behavior. This was obtained by dividing the total number of frames annotated for that behavior by the total number of frames in the recording.”

      (6) P11, line 253: I don't understand the modifications to MateBook regarding attempted copulations, neither in the results nor the methods section. I would ask the authors to explain more explicitly what was done.

      We thank the reviewer for this helpful suggestion. We have re-written several parts of the Materials and methods to clarify these details and streamline the text. To train the attempted copulation classifier, we combined datasets from assays with mated and decapitated virgin females, using manual annotations as ground truth. We also adapted MateBook’s persistence filters (Ribeiro et al., 2018) and defined thresholds explicitly: mounting lasting >45 s (>1350 frames at 30 fps) was defined as copulation, whereas abdominal curling without mounting, or mounting lasting 0.33– 45 s, was defined as attempted copulation.

      (7) Figure 7F: this is the only case with a significant difference between the two setups. What explanations do the authors have for the discrepancy?

      We thank the reviewer for raising this point. After repeating the experiments, we no longer found a significant difference between the setups. Figure 7 and its legend have been updated to reflect these results.

      (8) Figure 2 - Supplement 1: I do not understand why the boxes for Observer 1 have different colors in different figures. Does this have a meaning?

      Thanks for pointing this out. The color differences had no intended meaning, and we have corrected the figure for consistency across panels.

      (9) P22, line 517ff: It would be interesting to know how frequently identity switches occurred. For large-scale, automatic behavioral screenings that step could be a crucial bottleneck.

      We thank the reviewer for this valuable suggestion. We analyzed identity switches using the FlyTracker “Visualizer” package, which flags frames with possible overlaps or jumps. Flagged intervals were manually verified, and we report these data in new Supplementary File 5. Identity switch rates were very low: 0.66% for high-resolution recordings and 1.9% for smartphone DANCE videos in the most challenging decapitated-virgin dataset. These findings demonstrate robust tracking performance under both setups.

    1. Practicing decolonial allyship within a White settler queer family, alsomeans deepening an understanding of the way colonial narratives may beembedded within “social justice,” “intersectional,” or “critical literacy” dis-courses and practices despite their claim to do the opposite. For example,it has been important to Cindy that the story her daughter hears (and tells)about Indigenous people in Canada, is not only a story of oppression butalso of resistance and resilience.

      This passage really made me think. A lot of the time, when we learn about Indigenous people, we often hear about how they were oppressed, and it focuses on their suffering, but it hardly ever mentions their strength to stand tall despite the oppression they face every single day. I think it's important for both facts to coexist.

      As the text mentions, things like social justice aren't always upheld. This is because, inherently, the structures in Western society benefit White colonizers the most. This raises the question of, "What can we do to change the structure that oppresses Indigenous people and people of colour?"

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      1. First, the authors have not convincingly shown that skin cells, or more specifically skin ECs, are a major source of circulating G-CSF in the psoriasis model as stated in the title and abstract. The data in Figure 4 show selective upregulation of Csf3 gene in skin ECs and their ability to secrete G-CSF upon IMQ treatment in vitro. However, the provided data do not address to what degree the skin EC-derived G-CSF contributes to the elevated level of circulating G-CSF. Additional experiments to selectively deplete G-CSF in skin ECs, or at least in skin cells of the affected site, are warranted to support the authors' claim. Does intradermal injection of G-CSF neutralizing antibody into the psoriatic skin reduce circulating levels of G-CSF?

      Author's response:

      Thank you for reviewer's comment. We agree with the Reviewer#1 that it is important to directly block G-CSF to the skin via intradermal injection and measure the G-CSF level in the serum afterwards. Therefore, we will perform intradermal injection of IgG-isotype or anti-G-CSF antibody into the IMQ-induced psoriatic mice.

      Another concern is insufficient demonstration of G-CSF-mediated emergency granulopoiesis in the psoriasis model. All data in Figure 5 were obtained from experiments with only n=3, and adding more replicates, in particular to those in Figure 5B, which show quite some variation in MPP numbers, is recommended. The relatively small reduction of BM granulocyte numbers (Figure 5C) compared to greater depletion of circulating granulocytes (Figure S5A) raises the possibility that it is the mobilization effect rather than granulopoiesis-stimulating effect that skin-derived G-CSF exerts to promote supply of circulating neutrophils that eventually infiltrate into the affected skin. This could also explain the negligible effect of IL-1blockade (Figure S4), which selectively shut off myelopoiesis-stimulating effect of IL-1 (Pietras et al. Nat Cell Biol 2016, PMID: 27111842). Are the HSPCs in the psoriasis model more cycling? Do they show myeloid-skewed differentiation when cultured ex vivo or upon transplantation?

      Author's response: Thank you for these critical comments. We agree to do the following experiments to address them:

      1) HSPCs quantification in Figure 5 especially the MPPs will be added with more replicates.

      2) We will assess cycling status of HSPCs by flow cytometric analysis of Ki67and Propidium Iodide to characterize G0, G1 and G2/M cell cycle phase.

      3) To test myeloid-skewed differentiation, Lin- c-Kit+ Sca-1+ cells containing HSPCs will be isolated from bone marrow of Vas/IMQ-treated mice and transplanted into lethally irradiated syngeneic mice.

      The authors' claim that skin-derived G-CSF "induces" neutrophil infiltration warrants further clarification. Alternative explanation is that the upregulated neutrophil-attracting chemokines (Figure S1D) could induce infiltration, whereas G-CSF increase the number of neutrophils to circulate in the vessels near the psoriatic skin. This notion seems supported elsewhere (Moos et al. J Invest Dermatol. 2019, PMID: 30684554). Can the infiltration be inhibited by systemically injecting neutralizing antibody of their receptor, CXCR2?

      Author's response: The manuscript focuses on the skin-derived G-CSF function as a long-distance signal for emergency granulopoiesis in the bone marrow upon psoriasis, not the chemoattractant property of it. The sentence of interest is "We found that upon psoriasis induction, skin-resident endothelial cells are activated to produce G-CSF which activates emergency granulopoiesis in bone marrow and induces cutaneous infiltration and accumulation of neutrophil that are functionally inflammatory." in line 28-30. In agreement with point #2 from Reviewer#2, the fact that neutrophil recruitment factors (CXCL1, CXCL2, and CXCL5) were upregulated in psoriatic skin (Figure S1D), suggesting a CXCL-mediated neutrophil recruitment. The sentence of concern need to be changed to "We found that upon psoriasis induction, skin-resident endothelial cells are activated to produce G-CSF which activates emergency granulopoiesis in bone marrow, leading to cutaneous accumulation of neutrophil that are functionally inflammatory.". This revised sentence has omitted the proposal that G-CSF directly dictates neutrophils mobilization to the skin, which is not the key message of the study. Therefore, we found that the CXCR2 (CXCLs receptor) blockade experiment may be of the benefit of future studies.

      It remains unclear how skin-derived G-CSF accumulates pathogenic neutrophils. The authors state "pathogenic granulopoiesis," but are the circulating neutrophils in the psoriatic mice already "pathogenic" or do they acquire pathogenic phenotype after cutaneous infiltration? Additional RNA-seq to compare circulating and infiltrated neutrophils would answer this question.

      Author's response: We appreciate this valuable comment. We will perform RNA-seq with the peripheral blood-circulating neutrophils (CD45+ CD11b+ Ly6G+ Ly6Cmid) versus skin-infiltrating neutrophils from both Vas/IMQ mice.

      In addition, how the accumulated pathogenic neutrophils exacerbate the psoriatic changes remains obscure. Although the authors have attempted to correlate Il17a gene expression in infiltrated neutrophils with psoriatic skin changes, the data do not address to what degree it contributes to cutaneous IL-17A protein levels. The data that cutaneous neutrophil depletion leads to subtle decrease in skin IL-17A expression (Figure 2H) rather supports alternative possibilities. For instance, as indicated elsewhere, IL-17A cutaneous tone could be enhanced by neutrophil-mediated augmentation of Th17 or gamma/delta T cell function (Lambert et al. J Invest Dermatol. 2019, PMID: 30528823). Does neutrophil depletion or G-CSF neutralization alter cell numbers or function of cutaneous Th17 and gamma/delta T cells?

      Author's response: Thank you for this insightful comment. To better understand the relative contribution of neutrophils to the cutaneous IL-17A tone in the psoriatic skin, we will perform flowcytometric analysis of Th17 and gamma/delta T cells which are widely known as the major source of IL-17 in psoriatic skin of IMQ-induced mice following injection of isotype-matched or anti-Ly6G antibody.

      Finally, as the above conclusions rely solely on the IMQ-induced acute psoriasis model, it would be informative if they could be derived from another psoriasis model. IMQ is known to induce unintended systemic inflammation due to grooming-associated ingestion (Gangwar et al. J Invest Dermatol. 2022, PMID: 34953514), and "pathological crosstalk between skin and BM in psoriatic inflammation" could be strengthened by an intradermal injection model.

      Author's response: We appreciate the reviewer for bringing this important point. Regarding the systemic inflammation upon psoriasis, the above-cited study reported increased IFN-B expression in the intestines of IMQ-ingested animal (Grine L et al. Sci Rep. 2016, PMID: 26818707 in Gangwar et al. J Invest Dermatol. 2022, PMID: 34953514). We examined several pro-inflammatory cytokines including IFN-b, IFN-g, and IL-6 and in contrast, found no systemic increase in all these cytokines, except for IFN-g downregulation (Explanation Figure 1), which suggests no evidence of grooming-associated ingestion.

      We also examined the Csf3 expression across several distinctively located tissues which showed a selective upregulation in the skin (Figure 4C), suggesting a skin-restricted perturbation. In addition, one study showed that IMQ-ingestion didn't alter number of gut injury-associated CXCR3+ macrophages nor did it aggravate skin inflammation (Pinget et al. Cell Reports. 2022, PMID: 35977500). Together, these findings support that IMQ-induced psoriasis by topical cutaneous application used in our study elicit a local inflammation but not systemic inflammation.

      The authors, however, realize that testing alternative psoriasis model such as intradermal injection of IL-23 (Chan et al. J Exp Med. 2006, PMID: 17074928) will strengthen the skin-local insults within the psoriasis model employed, and should be tested in the future.

      Minor comments

      Figure 1E shows multiple elongated Ly6G+ structures in d0-2 control and d0 IMQ skins that do not appear to be neutrophils.

      Author's response: We appreciate the Reviewer#1 pointing this issue. As mentioned by the Reviewer#1, the elongated structures detected in the intravital microscopy are not neutrophils, but autofluorescence from the skin bulge regions (Wun et al. J Invest Dermatol. 2005, PMID: 15816847). We have eliminated these unspecific signals from the transformation and quantification (Figure 1F, S1G, and S1H). We will also add an explanatory sentence in Materials and Methods section "Of note, the fluorescent signal with elongated structures resembling hair bulge were autofluorescence and thus removed from further analysis." to be more precise about our methods.

      In Figure 2C, the bottom GSEA seems to be showing type II IFN response, not type I IFN, according to the text.

      Author's response: Thank you for the comment, we will correct this misspelling.

      Author's response: We appreciate that Reviewer#1 bring up this point. We examined the kinetics of the bone marrow cellularity and GMPs across 4 days of psoriasis induction in mice. The bone marrow cell number was lowered along that span with lowermost count at 2 days. Consistent to the BM-cellularity, the GMP number was also lowered about one-third in the first 2 days of psoriasis. This kinetic is consistent with the previous report showing a rapid reduction of GMPs in the bone marrow within 2 days following systemic G-CSF administration driven emergency granulopoiesis (Hirai et al. Nat. Immunol. 2006, PMID: 16751774). From 2 days to 4 days, the GMP number rapidly increased to slightly above basal number (Explanation Figure 2). This timely coordinated expansion suggests a significant supply of GMPs from the differentiating upstream myeloid progenitors (Figure 3B).

      When the psoriatic mice with elevated G-CSF is injected with anti-G-CSF or IgG-isotype antibody, the bone marrow cellularity and GMP numbers at 4 days were (Explanation Figure 3). Firstly, as psoriasis reduced bone marrow cellularity (Explanation Figure 2), the unchanged number after anti-G-CSF injection indicates that administration of 10µg/day for 4 days does not significantly affect mobilization of psoriatic bone marrow cells. Secondly, the similar GMP numbers at 4 days psoriasis is plausibly due to snapshot analysis when it has already in the numerical recovery period (Explanation Figure 2). Importantly, the notion that anti-G-CSF injection to psoriatic mice reduced granulocytes in the bone marrow, peripheral blood, and skin suggesting G-CSF as a key mediator in psoriatic driven emergency granulopoiesis on top of unlikely case of ineffective anti-G-CSF treatment.

      Taken together, these data suggest a G-CSF mediated emergency granulopoiesis occurrence in the IMQ-induced psoriasis. We will put these data into a revised Figure.

      In Figures 6B, in which cluster of human skin cells IL-17A expression would be enriched?

      Author's response: Thank you for this important point. The IL-17A expression is found in the T-cell cluster (Explanation Figure 4). We also expected to see IL-17A contribution from other cell subset(s), in particular neutrophil. However, due to the fragile nature of neutrophils and thereby, technical difficulty to get their sequencing reads, this dataset (GSE173706) doesn't contain neutrophils, but rather monocytes, macrophages, and dendritic cells among the myeloid subset (Explanation Figure 5). With this, it leaves open the question on what potential contribution of IL-17A produced by neutrophils is in human psoriasis (Reich et al. Exp. Dermatol. 2015, PMID: 25828362).

      Figure 1E shows multiple elongated Ly6G+ structures in d0-2 control and d0 IMQ skins that do not appear to be neutrophils.

      Author's response: We appreciate the Reviewer#1 pointing this issue. As mentioned by the Reviewer#1, the elongated structures detected in the intravital microscopy are not neutrophils, but autofluorescence from the skin bulge regions (Wun et al. J Invest Dermatol. 2005, PMID: 15816847). We have eliminated these unspecific signals from the transformation and quantification (Figure 1F, S1G, and S1H). We will also add an explanatory sentence in Materials and Methods section "Of note, the fluorescent signal with elongated structures resembling hair bulge were autofluorescence and thus removed from further analysis." to be more precise about our methods.

      In Figure 2C, the bottom GSEA seems to be showing type II IFN response, not type I IFN, according to the text.

      Author's response: Thank you for the comment, we will correct this misspelling.

      Reviewer#2

      1. Interpretation of neutrophil transcriptomic changes (Figure 2)

      The RNA-seq analysis reveals substantial downregulation of several canonical pro inflammatory pathways in neutrophils from psoriatic skin, including IL-6, IL-1, and type II interferon signaling. The authors should discuss the functional relevance of this unexpected transcriptional repression. For example, does this indicate a shift toward specialized effector functions rather than classical cytokine responsiveness? More importantly, the most striking transcriptional change is the upregulation of NADPH oxidase-related genes (e.g., Nox1, Nox3, Nox4, Enox2). This suggests an oxidative stress-driven pathogenic mechanism, potentially more relevant than IL-17A production. Yet this aspect is not explored in the manuscript. Assessing ROS levels or oxidative neutrophil effector functions in this model would considerably strengthen the mechanistic link. Conversely, although IL-17A is upregulated in neutrophils, neutrophil depletion reduces total Il17a expression in skin only partially. This indicates that neutrophils are unlikely to be the dominant IL-17A source in the lesion. The authors' focus on neutrophil-derived IL 17A therefore seems overstated. A more rigorous assessment-e.g., conditional deletion of Il17a specifically in neutrophils-would be required to establish its true contribution. Taken together, the data suggest that oxidative programs, rather than IL-17A production, may represent the principal pathogenic axis downstream of neutrophils, and this deserves deeper discussion.

      Author's response: Thank you for raising this valuable views. We have agreed to address these critical points by the following approaches:

      1) To address the changes in NADPH oxidase-related gene signature, we will measure ROS production in the neutrophils from skin and peripheral blood with DHR123.

      2) Responding to the IL17A contribution by neutrophils, we will flow cytometrically assess the Th17 and gamma/delta T cell population in the skin of psoriatic mice treated with anti-Ly6G or isotype-matched antibody as was suggested by Reviewer#1.

      3) We will discuss downregulation of the canonical pro inflammatory and IL-17 pathways in the psoriatic neutrophils in the discussion.

      Human data reanalysis (Figure 6):

      The re-analysis of bulk and single-cell RNA-seq datasets is valuable but incomplete. Several mechanistically relevant questions could be addressed with the available data:

      2.1. GM-CSF (CSF2) is also strongly upregulated in psoriatic lesions (bulk RNA-seq). It would be informative to determine whether endothelial cells also express CSF2 in the scRNA-seq dataset, as this would suggest coordinated regulation of myeloid-supporting cytokines.

      2.2. Myeloid cell subsets should be examined more closely. A comparison of human myeloid transcriptomes with the mouse neutrophil RNA-seq would clarify whether similar IL-17A-related or NADPH oxidase-related signatures occur in human disease. In particular, which cell types express IL17A in human lesions?

      2.3. Chemokine production should be attributed to specific cell types. Bulk RNA-seq confirms strong induction of CXCL1, CXCL2, CXCL5, but the scRNA-seq dataset allows determining whether these chemokines originate from endothelial cells or other stromal/immune populations. This information is important for defining whether endothelial cells coordinate both neutrophil recruitment and granulopoiesis.

      Addressing these points would make the human-mouse comparison substantially stronger.

      Author's response: Thank you for pointing these important issues. By reanalyzing the dataset, we found several points regarding the comments, as follows:

      2.1) CSF2 is expressed by T-cell cluster in the human skin dataset (Explanation Figure 4), in agreement with previous murine study (Hartwig et al. Cell Reports. 2018, PMID: 30590032). We will add this data in the revised manuscript.

      2.2) In line with point#10 from Reviewer#1, the dataset clearly shows T-cell cluster as the main IL17A source (Explanation Figure 4 above). The dataset, however, doesn't contain phenotypic neutrophils (CEACAM (CD66b) and PGLYRP1) but monocytes, macrophages, and dendritic cells (Explanation Figure 5 above). This loss was probably due to a technical limitation given the difficulty in capturing sequencing reads from fragile neutrophils. Therefore, it is no longer possible to reanalyze IL-17 expression in the absence of neutrophils in the datapool.

      2.3) Reanalysis of CXCLs in the human scRNAseq dataset (GSE173706) clarified their secretion dynamics and cellular sources under normal and psoriatic condition. In normal skin, all examined cell subsets show only low CXCLs expression. In contrast, psoriatic skin exhibits significant CXCLs upregulation with distinct cell subsets clearly showing dramatic upregulation, potentially being the major CXCLs source. CXCL1 is markedly upregulated in fibroblasts, myeloid cells, and melanocyte and nerve cells. CXCL2 is strikingly upregulated to myeloid cells, while CXCL5 is hugely increased in fibroblasts, myeloid cells, and mast cells (Explanation Figure 7). Taken together, these results suggest that CXCLs upregulation in the psoriatic skin is coordinatively executed by both stromal and immune compartments. Of note, the endothelial cells show minimal changes in CXCLs expression, even downregulate CXCL2 in psoriasis, indicating that they are unlikely to be the major contributor to CXCL-mediated neutrophil recruitment.

      **Referees cross-commenting**

      I agree with Reviewer 1 that the contribution of EC-derived G-CSF to circulating G-CSF levels and to emergency myelopoiesis requires additional genetic or neutralization experiments to be fully established.

      Author's response: We appreciate that Reviewer#2 raised this key point. In addition to examining the serum G-CSF upon intradermal anti-G-CSF administration in point#1 from Reviewer#1 above, we will also examine the emergency myelopoiesis signs in vivo.

      Minor points

      1. Line 319: the text likely refers to Figure S4, not S3.

      Author's response: Thank you, we will correct the nomenclature.

      Line 338: "psoriatic" is misspelled.

      Author's response: Thank you, we will change this to "psoriatic".

      Reviewer #3

      • Place the work in the context of the existing literature (provide references, where appropriate).

      Psoriasis is extensively studied, a good recent reference- https://doi.org/10.1016/j.mam.2024.101306

      Author's response: Thank you for Reviewer#3's suggestion. The referenced study highlights the current paradigm that largely focus on skin-restricted mechanism and overlook potential cross-organ interaction in the psoriasis inflammation. Our findings provide a new insight into the skin-bone marrow crosstalk in the disease context. In addition, the suggested reference underscores the key roles of diverse innate immune cells including neutrophils, eosinophils, dendritic cells, etc. which is fundamental for our study and might also guide future exploration of additional innate cell subsets beyond neutrophils. We will therefore include the mentioned reference to our revised manuscript.

      • Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

      It is all good. May add graphical-abstract.

      Author's response: Thank you for the reviewer's input, we agree that a graphical-abstract will help the readers more clearly grasp the key messages of our manuscript. We will include it in the revised manuscript.

      Major comments:

      • Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      No. It is very solid.

      Author's response: We appreciate the reviewer's view that the claims in our paper are solid.

      • Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      Such a discovery clearly opens many options, and it is fascinating to suggest additional experiments for future studies. It is a complete study, best to publish as-is and let many to read and proceed with this new concept.

      Author's response: We thank the reviewer for noting that the current experimental evidence is complete that no additional experiments are necessary at this stage. We agree that the discovery opens prospective directions for future studies.

      • Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      N/A - I suggest no additional experiments at this point. Get it published and see how many will follow this new direction!

      Author's response: We thank the reviewer for recognizing that the experimental data has been sufficient to be a foundation for the future research.

      • Are the data and the methods presented in such a way that they can be reproduced?

      Yes.

      Author's response: We thank the reviewer for recognizing that our methods are reproducible.

      • Are the experiments adequately replicated, and is the statistical analysis adequate?

      Yes. The data are of very high quality.

      Author's response: We are grateful that the reviewer view our replication strategy and statistical analysis are of a high quality.

      Minor comments:

      • Specific experimental issues that are easily addressable.

      None. It is good as-is. One may always suggest minor things- but this one is better published so many laboratories may rush for this new direction. I think it will be interesting studying some long-term impacts, and changes not only of neutrophils but also of other innate cells, such as DCs, Macrophages, and Eosinophils - so it is best to let laboratories that focus on these cells know of the discovery and pursue independent studies.

      Author's response: We appreciate the reviewer's assessment that our paper is already well set for the community to explore the newly proposed direction.

      • Are the text and figures clear and accurate?

      Yes.

      Author's response: We thank the reviewer's evaluation. We have ensured that the text and figures in our manuscript are clear and accurate. Once again, we thank the reviewer for the encouraging and constructive appraisal. We are pleased that the reviewer find the manuscript has already been strong and suitable for publication.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary The manuscript by Aarts et al. explores the role of GRHL2 as a regulator of the progesterone receptor (PR) in breast cancer cells. The authors show that GRHL2 and PR interact in a hormone-independent manner and based on genomic analyses, propose that they co-regulate target genes via chromatin looping. To support this model, the study integrates both newly generated and previously published datasets, including ChIP-seq, CUT&RUN, RNA-seq, and chromatin interaction assays, in breast cancer cell models (T47DS and T47D).

      Major comments: R1.1 Novelty of GRHL2 in steroid receptor biology The role of GRHL2 as a co-regulator of steroid hormone receptors has previously been described for ER (J Endocr Soc. 2021;5(Suppl 1):A819) and AR (Cancer Res. 2017;77:3417-3430). In the ER study, the authors also employed a GRHL2 ΔTAD T47D cell model. Therefore, while this manuscript extends GRHL2 involvement to PR, the contribution appears incremental rather than conceptual.

      We are fully aware of the previously described role of GRHL2 as a co-regulator of steroid hormone receptors, particularly ER and AR. As acknowledged in our introduction (lines 104-108), we explicitly state: "Grainyhead-like 2 (GRHL2) has recently emerged as a potential pioneer factor in hormone receptor-positive cancers, including breast cancer21. However, nearly all studies to date have focused on GRHL2 in the context of ER and estrogen signaling, leaving its role in PR- and progesterone-mediated regulation unexplored22-26".

      As for the specific publications that the reviewer refers to: The first refers to an abstract from an annual meeting of the Endocrine Society. As we have been unable to assess the original data underpinning the abstract - including the mentioned GRHL2 DTAD model - we prefer not to cite this particular reference. We do cite other work by the same authors (Reese et al. 2022, our ref. 25). We also cite the AR study mentioned by the reviewer (our ref. 55) in our discussion. As such, we think we do give credit to prior work done in this area.

      By characterizing GRHL2 as a co-regulator of the progesterone receptor (PR), we expand on the current understanding of GRHL2 as a common transcriptional regulator within the broader context of steroid hormone receptor biology. Given that ER and PR are frequently co-expressed and active within the same breast cancer cells, our findings raise the important possibility that GRHL2 may actively coordinate or modulate the balance between ER- and PR-driven transcriptional programs, as postulated in the discussion paragraph.

      Importantly, we also functionally link PR/GRHL2-bound enhancers to their target genes (Fig5), providing novel insights into the downstream regulatory networks influenced by this interaction. These results not only offer a deeper mechanistic understanding of PR signaling in breast cancer but also lay the groundwork for future comparative analyses between GRHL2's role in ER-, AR-, and PR-mediated gene regulation.

      As such, we respectfully suggest that our work offers more than an incremental advance in our knowledge and understanding of GRHL2 and steroid hormone receptor biology.

      R1.2 Mechanistic depth The study provides limited mechanistic insight into how GRHL2 functions as a PR co-regulator. Key mechanistic questions remain unaddressed, such as whether GRHL2 modulates PR activation, the sequential recruitment of co-activators/co-repressors, engages chromatin remodelers, or alters PR DNA-binding dynamics. Incorporating these analyses would considerably strengthen the mechanistic conclusions.

      Although our RNA-seq data demonstrate that GRHL2 modulates the expression of PR target genes, and our CUT&RUN experiments show that GRHL2 chromatin binding is reshaped upon R5020 exposure, we acknowledge that we have not further dissected the molecular mechanisms by which GRHL2 functions as a PR co-regulator.

      We did consider several follow-up experiments to address this, including PR CUT&RUN in GRHL2 knockdown cells, CUT&RUN for known co-activators such as KMT2C/D and P300, as well as functional studies involving GRHL2 TAD and DBD mutants. However, due to technical and logistical challenges, we were unable to carry out these experiments within the timeframe of this study.

      That said, we fully recognize that such approaches would provide deeper mechanistic insight into the interplay between PR and GRHL2. We have therefore explicitly acknowledged this limitation in our limitations of the study section (line 502-507) and mention this as an important avenue for future investigation.

      R1.3 Definition of GRHL2-PR regulatory regions (Figure 2) The 6,335 loci defined as GRHL2-PR co-regulatory regions are derived from a PR ChIP-seq performed in the presence of hormone and a GRHL2 ChIP-seq performed in its absence. This approach raises doubts about whether GRHL2 and PR actually co-occupy these regions under ligand stimulation. GRHL2 ChIP-seq experiments in both hormone-treated and untreated conditions are necessary to provide stronger support for this conclusion.

      Although bulk ChIP-seq cannot definitively demonstrate simultaneous binding of PR and GRHL2 at the same genomic regions, we agree that the ChIP-seq experiments we present do not provide a definitive answer on if GRHL2 and PR co-occupy these regions under ligand stimulation. As a first step to address this, we performed CUT&RUN experiments for both GRHL2 and PR under untreated and R5020-treated conditions. These experiments revealed a subset of overlapping PR and GRHL2 binding sites (approximately {plus minus}5% of the identified PR peaks under ligand stimulation).

      We specifically chose CUT&RUN to minimize artifacts from crosslinking and sonication, thereby reducing background and enabling the mapping of high-confidence direct DNA-binding events: Given that a fraction of GRHL2 physically interacts with PR (Fig1D), it is possible that ChIP-seq detects indirect binding of GRHL2 at PR-bound sites and vice versa. CUT&RUN, by contrast, allows us to identify direct binding sites with higher confidence.

      Nonetheless, although outside the scope of the current manuscript, we agree that a dedicated GRHL2 ChIP with and without ligand stimulation would provide additional insight, and we have accordingly added this suggestion to the discussion (line 502-507).

      R1.4 Cell model considerations The manuscript relies heavily on the T47DS subclone, which expresses markedly higher PR levels than parental T47D cells (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Kalkhoven et al., Int J Cancer 1995). This raises concerns about physiological relevance. Key findings, including co-IP and qPCR-ChIP experiments, should be validated in additional breast cancer models such as parental T47D, BT474, and MCF-7 cells to generalize the conclusions. Furthermore, data obtained from T47D (PR ChIP-seq, HiChIP, CTCF and Rad21 ChIP-seq) and T47DS (RNA-seq, CUT&RUN) are combined along the manuscript. Given the substantial differences in PR expression between these cell lines, this approach is problematic and should be reconsidered.

      We agree that physiological relevance is important to consider. Here, all existing model systems have some limitations. In our experience, it is technically challenging to robustly measure gene expression changes in parental T47D cells (or MCF7 cells, for that matter) in response to progesterone stimulation (Aarts et al., J Mammary Gland Biol Neoplasia 2023). As we set out to integrate PR and GRHL2 binding to downstream target gene induction, we therefore opted for the most progesterone responsive model system (T47DS cells). We agree that observations made in T47D and T47DS cells should not be overinterpreted and require further validation. We have now explicitly acknowledged this and added it to the discussion (line 507-509).

      As for the reviewer's suggestion to use MCF7 cells: apart from its suboptimal PR-responsiveness, this cell line is also known to harbor GRHL2 amplification, resulting in elevated GRHL2 levels (Reese et al., Endocrinology2019). By that line of reasoning, the use of MCF7 cells would also introduce concerns about physiological relevance. That being said, and as noted in the discussion (line 390-391), the study by Mohammed et al. which identified GRHL2 as a PR interactor using RIME, was performed in both MCF7 and T47D cells. This further supports the notion that the PR-GRHL2 interaction is not limited to a single cell line.

      R1.5 CUT&RUN vs ChIP-seq data The CUT&RUN experiments identify fewer than 10% of the PR binding sites reported in the ChIP-seq datasets. This discrepancy likely results from methodological differences (e.g., absence of crosslinking, potential loss of weaker binding events). The overlap of only 158 sites between PR and GRHL2 under hormone treatment (Figure 3B) provides limited support for the proposed model and should be interpreted with greater caution.

      We acknowledge the discrepancy between the number of binding sites between ChIP-seq and CUT&RUN. Indeed, methodological differences likely contribute to the differences in PR binding sites reported between the ChIP-seq and CUT&RUN datasets. As the reviewer correctly notes, the absence of crosslinking and sonication in CUT&RUN reduces detection of weaker binding events. However, it also reduces the detection of indirect binding events which could increase the reported number of peaks in ChIPseq data (e.g. the common presence of "shadow peaks").

      As also discussed in our response to R1.3, we deliberately chose the CUT&RUN approach to enable the identification of high-confidence direct DNA-binding events. Since GRHL2 physically interacts with PR, ChIP-seq could potentially capture indirect binding of GRHL2 at PR-bound sites, and vice versa. By contrast, CUT&RUN primarily captures direct DNA-protein interactions, offering a more specific binding profile. Thus, while the number of CUT&RUN binding sites is much smaller than previously reported by ChIP-seq, we are confident that they represent true, direct binding events.

      We would also like to emphasize that the model presented in figure 6 does not represent a generic or random gene, but rather a specific gene that is co-regulated by both GRHL2 and PR. In this specific case, regulation is proposed to occur via looping interactions from either individual TF-bound sites (e.g., PR-only or GRHL2-only) or shared GRHL2/PR sites. We do not propose that only shared sites are functionally relevant, nor do we assume that GRHL2 and PR must both be directly bound to DNA at these shared sites. Therefore, overlapping sites identified by ChIP-seq-potentially reflecting indirect binding events-could indeed be missed by CUT&RUN, yet still contribute to gene regulation. To clarify this, we have revised the main text (line 331-334) and the legend of Figure 6 to explicitly state that the model refers to a gene with established co-regulation by both GRHL2 and PR.

      R1.6 Gene expression analyses (Figure 4) The RNA-seq analysis after 24 hours of hormone treatment likely captures indirect or secondary effects rather than the direct PR-GRHL2 regulatory program. Including earlier time points (e.g., 4-hour induction) in the analysis would better capture primary transcriptional responses. The criteria used to define PR-GRHL2 co-regulated genes are not convincing and may not reflect the regulatory interactions proposed in the model. Strong basal expression changes in GRHL2-depleted cells suggest that much of the transcriptional response is PR-independent, conflicting with the model (Figure 6). A more straightforward approach would be to define hormone-regulated genes in shControl cells and then examine their response in GRHL2-depleted cells. Finally, integrating chromatin accessibility and histone modification datasets (e.g., ATAC-seq, H3K27ac ChIP-seq) would help establish whether PR-GRHL2-bound regions correspond to active enhancers, providing stronger functional support for the proposed regulatory model.

      We thank the reviewer for pointing this out. We now recognize that our criteria for selecting PR/GRHL2 co-regulated genes were not clearly described. To address this, we have revised our approach as per the reviewer's suggestion: we first identified early and sustained PR target genes based on their response at 4 and 24 hours of induction and subsequently overlaid this list with the gene expression changes observed in GRHL2-depleted cells. This revised approach reduced the amount of PR-responsive, GRHL2 regulated target genes from 549 to 298 (46% reduction). We consequently updated all following analyses, resulting in revised figures 4 and 5 and supplementary figures 2,3 and 4. As a result of this revised approach, the number of genes that are transcriptionally regulated by GRHL2 and PR (RNAseq data) that also harbor a PR loop anchor at or near their TSS after 30 minutes of progesterone stimulation (PR HiChIP data) dropped from 114 to 79 (30% reduction). We thank the reviewer for suggesting this more straightforward approach and want to emphasize that our overall conclusions remain unaltered.

      As above in our response to R1.3, we want to emphasize that the model presented in figure 6 does not depict a generic or randomly chosen gene, but a gene that is specifically co-regulated by both GRHL2 and PR. We also want to emphasize that the majority of GRHL2's transcriptional activity is PR-independent. This is consistent with the limited fraction of GRHL2 that co-immunoprecipitated with PR (Figure 1D), and with the well-established roles of GRHL2 beyond steroid receptor signaling. In fact, the overall importance of GRHL2 for cell viability in T47D(S) cells is underscored by our inability to generate a full knockout (multiple failed attempts of CRISPR/Cas mediated GRHL2 deletion in T47D(S) and MCF7 cells), and by the strong selection we observed against high-level GRHL2 knockdown using shRNA.

      As for the reviewer's suggestion to assess whether GRHL2/PR co-bound regions correspond to active enhancers by integrating H3K27ac and ATAC-seq data: We have re-analyzed publicly available H3K27ac and ATAC-seq datasets from T47D cells (references 42 and 43). These analyses are now added to figure 2 (F and G). The H3K27Ac profile suggests that GRHL2-PR overlapping sites indeed correspond to more active enhancers (Figure 2F), with a proposed role for GRHL2 since siGRHL2 affects the accessibility of these sites (Figure 2G).

      Minor comments Page 19: The statement that "PR and GRHL2 trigger extensive chromatin reorganization" is not experimentally supported. ATAC-seq would be an appropriate method to test this directly.

      We agree with the reviewer and have removed this sentence, as it does not contribute meaningfully to the flow of the manuscript.

      Prior literature on GRHL2 as a steroid receptor co-regulator should be discussed more thoroughly.

      We now added additional literature on GRHL2 as a steroid hormone receptor co-regulator in the discussion (line 397-401) and we cite the papers suggested by R1 in R1.1 (references 25 and 54).

      Reviewer #1 (Significance (Required)):

      The identification of novel PR co-regulators is an important objective, as the mechanistic basis of PR signaling in breast cancer remains incompletely understood. The main strength of this study lies in highlighting GRHL2 as a factor influencing PR genomic binding and transcriptional regulation, thereby expanding the repertoire of regulators implicated in PR biology.

      That said, the novelty is limited, given the established roles of GRHL2 in ER and AR regulation. Mechanistic insight is underdeveloped, and the reliance on an engineered T47DS model with supra-physiological PR levels reduces the general impact. Without validation in physiologically relevant breast cancer models and clearer separation of direct versus indirect effects, the overall advance remains modest.

      The manuscript will be of interest to a specialized audience in the fields of nuclear receptor signaling, breast cancer genomics, and transcriptional regulation. Broader appeal, including translational or clinical relevance, is limited in its current form.

      We have addressed all of these points in our response above and agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The authors present a study investigating the role of GRHL2 in hormone receptor signaling. Previous research has primarily focused on GRHL2 interaction with estrogen receptor (ER) and androgen receptor (AR). In breast cancer, GRHL2 has been extensively studied in relation to ER, while its potential involvement with the progesterone receptor (PR) remains largely unexplored. This is the rationale of this study to investigate the relation between PR and GRHL2. The authors demonstrate an interaction between GRHL2 and PR and further explore this relationship at the level of genomic binding sites. They also perform GRHL2 knockdown experiments to identify target genes and link these transcriptional changes back to GRHL2-PR chromatin occupancy. However, several conceptual and technical aspects of the study require clarification to fully support the authors' conclusions.

      R2.1 Given the high sequence similarity among GRHL family members, this raises questions about the specificity of the antibody used for GRHL2 RIME. The authors should address whether the antibody cross-reacts with GRHL1 or GRHL3. For example, GRHL1 shows a higher log fold change than GRHL2 in the RIME data.

      Indeed, GRHL1, GRHL2, and GRHL3 are structurally related. They share a similar domain organization and are all {plus minus}70kDa in size. Sequence similarity is primarily confined to the DNA-binding domain, with GRHL2 and GRHL3 showing 81% similarity in this region, and GRHL1 showing 63% similarity to GRHL2/3 (Ming, Nucleic Acids Res 2018).

      The antibody used, sourced from the Human Protein Atlas, is widely used in the field. It targets an epitope within the transactivation domain (TAD) of GRHL2-a region with relatively low sequence similarity to the corresponding domains in GRHL1 and GRHL3.

      We assessed the specificity of the antibody using western blotting (Supplementary Figure 2A) in T47DS wild-type and GRHL2 knockdown cells. As expected, GRHL2 protein levels were reduced in the knockdown cells providing convincing evidence that the antibody recognizes GRHL2. The remaining signal in shGRHL2 knockdown cells could either be due to remaining GRHL2 protein or due to the antibody detecting GRHL1/3. Furthermore, the observed high log-fold enrichment of GRHL1 in our RIME may reflect known heterodimer formation between GRHL1 and GRHL2, rather dan antibody cross-reactivity. As such, we cannot formally rule out cross-reactivity and have mentioned this in the limitations section (line 497-501).

      R2.2 In addition, in RIME experiments, one would typically expect the bait protein to be among the most highly enriched proteins compared to control samples. If this is not the case, it raises questions about the efficiency of the pulldown, antibody specificity, or potential technical issues. The authors should comment on the enrichment level of the bait protein in their data to reassure readers about the quality of the experiment.

      We agree with the reviewer that this information is crucial for assessing the quality of the experiment. We have therefore added the enrichment levels (log₂ fold change of IgG control over pulldown) to the methods section (line 592).

      As the reviewer notes, GRHL2 was not among the top enriched proteins in our dataset. This is due to unexpectedly high background binding of GRHL2 to the IgG control antibody/beads, for which we currently have no explanation. As a result, although we detected many unique GRHL2 peptides, observed high sequence coverage (>70%), and GRHL2 ranked among the highest in both iBAQ and LFQ values, its relative enrichment was reduced due to the elevated background. During our RIME optimization, Coomassie blue staining of input and IP samples revealed a band at the expected molecular weight of GRHL2 in the pull down samples that was absent in the IgG control (see figure 1 for the reviewer below, 4 right lanes), supporting the conclusion that GRHL2 is specifically enriched in our GRHL2 RIME samples. Combined with enrichment of some of the expected interacting proteins (e.g. KMT2C and KMT2D), we are convinced that the experiment of sufficient quality to support our conclusions.

      Figure 1 for reviewer: Coomassie blue staining of input and IP GRHL2 and IgG RIME samples. NT = non-treated, T = treated.

      R2.3 The authors report log2 fold changes calculated using iBAQ values for the bait versus IgG control pulldown. While iBAQ provides an estimate of protein abundance within samples, it is not specifically designed for quantitative comparison between samples without appropriate normalization. It would be helpful to clarify the normalization strategy applied and consider using LFQ intensities.

      We understand the reviewer's concern. Due to the high background observed in the IgG control sample (see R2.2), the LFQ-based normalization did not accurately reflect the enrichment of GRHL2, which was clearly supported by other parameters such as the number of unique peptides (see rebuttal Table 1). After discussions with our Mass Spectrometry facility, we decided to consider the iBAQ values-which reflect the absolute protein abundance within each sample-as a valid and informative measure of enrichment. In the context of elevated background levels, iBAQ provides an alternative and reliable metric for assessing protein enrichment and was therefore used for our interactor analysis.

      Unique peptides

      IBAQ GRHL2

      IBAQ IgG

      LFQ GRHL2

      LFQ IgG

      GRHL2

      52

      1753400.00

      155355.67

      5948666.67

      3085700.00

      GRHL1

      23

      56988.33

      199.03

      334373.33

      847.23

      *Table 1. Unique peptide, IBAQ and LFQ values of the GRHL2 and IgG pulldowns for GRHL2 and GRHL1 *

      R2.4 Other studies have reported PR RIME, which could be a valuable source to investigate whether GRHL proteins were detected.

      We thank the reviewer for pointing this out. We are aware of the PR RIME, generated by Mohammed et al., which we refer to in the discussion (lines 390-391). This study indeed identified GRHL2 as a PR-interacting protein in MCF7 and T47D cells. Although they do not mention this interaction in the text, the interaction is clearly indicated in one of the figures from their paper, which supports our findings. To our knowledge, no other PR RIME datasets in MCF7 or T47D cells have been published to date.

      R2.5 In line 137, the term "protein score" is mentioned. Could the authors please clarify what this means and how it was calculated.

      We agree that this point was not clearly explained in the original text. The scores presented reflect the MaxQuant protein identification confidence, specifically the sum of peptide-level scores (from Andromeda), which indicates the relative confidence in protein detection. We have now added this clarification to line 137 and to the legend of Figure 1.

      R2.6 In line 140-141. The fact that GRHL2 interacts with chromatin remodelers does not by itself prove that GRHL2 acts as a pioneer factor or chromatin modulator. Demonstrating pioneer function typically requires direct evidence of chromatin opening or binding to closed chromatin regions (e.g., ATAC-seq, nucleosome occupancy assays). I recommend revising this statement or providing supporting evidence.

      We agree that the fact that GRHL2 interacts with chromatin remodelers does not by itself prove that GRHL2 acts as a pioneer factor or chromatin modulator. However, a previous study (Jacobs et al, Nature genetics, 2018) has shown directly that the GRHL family members (including GRHL2) have pioneering function and regulate the accessibility of enhancers. We adapted line 140-141 to state this more clearly. In addition, our newly added data in Figure 2G also support the fact that GRHL2 has a role in regulating chromatin accessibility in T47D cells.

      R2.7 The pulldown Western blot lacks an IgG control in panel D.

      This is correct. As the co-IP in Figure 1D served as a validation of the RIME and was specifically aimed at determining the effect of hormone treatment on the observed PR/GRHL2 interaction, we did not perform this control given the scale of the experiment. However, during RIME optimization, we performed GRHL2 staining of the IgG controls by western blot, shown in figure 2 for the reviewer below. As stated above, some background GRHL2 signal was observed in the IgG samples, but a clear enrichment is seen in the GRHL2 IP.

      Taken together, we believe that the well-controlled RIME, combined with the co-IP presented, provides strong evidence that the observed signal reflects a genuine GRHL-PR interaction.

      Figure 2 for reviewer: WB of input and IP GRHL2 and IgG RIME samples stained for GRHL2. NT = non-treated, T = treated

      R2.8 Depending on the journal and target audience, it may be helpful to briefly explain what R5020 is at its first mention (line 146).

      Thank you. We have adapted this accordingly.

      R2.9 The authors state that three technical replicates were performed for each experimental condition. It would be helpful to clarify the expected level of overlap between biological replicates of RIME experiments. This clarification is necessary, especially given the focus on uniquely enriched proteins in untreated versus treated cells, and the observation that some identified proteins in specific conditions are not chromatin-associated. Replicates or validations would strengthen the findings.

      We use the term technical rather than biological replicates because for cell lines, defining true biological replicates is challenging, as most variability arises from experimental rather than biological differences. To introduce some variation, we split our T47DS cells into three parallel dishes 5 days prior to starting the treatment. We purposely did this, to minimize to minimize the likelihood that proteins identified as uniquely enriched are artifacts. Each of the three technical replicates comes from one of these three parallel splits (so equal passage numbers but propagated in parallel dishes for 5 days before the start of the experiment).

      To generate the three technical replicates for our RIME, we plated cells from the parallel grown splits. Treatments for the three replicates were performed per replicate. Samples were crosslinked, harvested and lysed for subsequent RIME analysis, the three replicates were processed in parallel, for technical and logistical reasons. To clarify the experimental setup, we have updated the methods section accordingly (lines 566-568).

      As for the detection of non-chromatin-associated proteins: We cannot rule out that these are artifacts, as they may arise from residual cytosolic lysate during nuclear extraction. Alternatively, they could reflect a more dynamic subcellular localization of these proteins than currently annotated or appreciated.

      R2.10 The volcano plot for the RIME experiment appears to show three distinct clusters of proteins on the right, which is unusual for this type of analysis. The presence of these apparent groupings may suggest an artifact from the data processing, such as imputation. Can the authors clarify the origin of these groupings? If it is due to imputation or missing values, I recommend applying a stricter threshold, such as requiring detection in all three replicates (3/3) to improve the robustness of the enrichment analysis and increase confidence in the identified interactors.

      We thank the reviewer for pointing this out. As suggested, we re-evaluated the imputation and applied a stricter threshold, requiring detection in all three replicates. Indeed, the separate clusters were due to missing values, therefore we now revised the imputation method by imputing values based on the normal distribution. Using this revised analysis, we identify 2352 GRHL2 interactors instead of 1140, but the number of interacting proteins annotated as transcription factors or chromatin-associated/modifying proteins was still 103. Figure 1B, 1E, and Supplementary Figure 4A have been updated accordingly. We also revised the methods section to reflect this change. We think this suggestion has improved our analysis of the data and we thank the reviewer for pointing this out.

      R2.11 The statement that "PR and GRHL2 frequently overlap" may be overstated given that only ~700 overlapping sites are reported (cut&run).

      We have replaced "frequently overlap" by "can overlap" (line 229-230).

      R2.12 The model in Figure 6 suggests limited chromatin occupancy of PR and GRHL2 in hormone-depleted conditions, consistent with the known requirement of ligand for stable PR-DNA binding. However, Figure 1 shows no major difference in GRHL2-PR interaction between untreated and hormone-treated cells. This raises questions about where and how this interaction occurs in the absence of hormone. Since PR binding to chromatin is typically minimal without ligand, can the authors clarify this given that RIME data reflect chromatin-bound interactions.

      Indeed, the model in figure 6 suggests limited chromatin occupancy of PR and GRHL2 under hormone-depleted conditions. It is, however, important to note that the locus shown represents a gene regulated by both PR and GRHL2 - and not just any gene. We recognize that this was not sufficiently clear in the original version, and we have now clarified this in both the main text (line 331-334) and the figure legend.

      We propose that PR and GRHL2 bind or become enriched at enhancer sites associated with their target genes upon ligand stimulation. This is consistent with the known requirement of ligand for stable PR-DNA binding and with our observation that PR/GRHL2 overlapping peaks are detected only in the ligand-treated condition of the CUT&RUN experiment. Given the broader role of GRHL2, it also binds chromatin independently of progesterone and the progesterone receptor, which is why we included-but did not focus on-GRHL2-only binding events in our model.

      We would also like to clarify that, although RIME includes a nuclear enrichment step that enriches for chromatin-associated proteins, the pulldown is performed on nuclear lysates. Therefore, it captures both chromatin-bound protein complexes and freely soluble nuclear complexes, which unfortunately cannot be distinguished. GRHL2 is well established as a nuclear protein (Zeng et al., Cancers 2024; Riethdorf et al., International Journal of Cancer 2015), and although PR is classically described as translocating to the nucleus upon hormone stimulation, several studies-including our own-have shown that PR is continuously present in the nucleus (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Frigo et al., Essays Biochem. 2021).

      We therefore propose that PR and GRHL2 may already interact in the nucleus without directly binding to chromatin. Given our observation that GRHL2 binding sites on the chromatin are redistributed upon R5020 mediated signaling activation, we hypothesize that such pre-formed PR-GRHL2 nuclear complexes may assist the rapid recruitment of GRHL2 to progesterone-responsive chromatin regions.

      We have expanded the discussion to include a dedicated section addressing this point (line 376-388).

      R2.13 It would be of interest to assess the overlap between the proteins identified in the RIME experiment and the motif analysis results.

      In the discussion section of our original manuscript, we highlighted some overlapping proteins in the RIME and motif analysis, including STAT6 and FOXA1. However, we had not yet systematically analyzed overlap in both analyses. To address this, we now compared all enriched motifs (so not only the top 5 as displayed in our figures) under GRHL2, PR, and GRHL2/PR shared sites from both the CUT&RUN and ChIP-seq datasets with the proteins identified as GRHL2 interactors in our RIME. Although we identified numerous GRHL2-associated proteins, relatively few of them were transcription factors whose binding motifs were also enriched under GRHL2 peaks.

      In our revised manuscript we have added a section in the discussion highlighting our systematic overlap of the results of our RIME experiment and the motif enrichment of the ChIP-seq and CUT&RUN analysis (line 415-436).

      R2.14 The authors chose CUT&RUN to assess chromatin binding of PR and GRHL2. Given that RIME is also based on chromatin immunoprecipitation - ChIP protocol, it would be helpful to clarify why CUT&RUN was selected over ChIP-seq for the DNA-binding assays. What is the overlap with published data?

      As also mentioned in our response to R1.3 and R1.5, we deliberately chose the CUT&RUN approach to minimize artifacts introduced by crosslinking and sonication, thereby reducing background and allowing the identification of high-confidence, direct DNA-binding events. Since GRHL2 physically interacts with PR, ChIP-seq could potentially capture indirect binding of GRHL2 at PR-bound sites (and vice versa). In contrast, CUT&RUN primarily detects direct DNA-protein interactions, providing a more specific and accurate binding profile. Additionally, CUT&RUN serves as an independent validation method for data obtained using ChIP-like protocols.

      Since CUT&RUN, similar to ChIP, can show limited reproducibility (Nordin et al., Nucleic Acids Research, 2024), and to our knowledge few PR CUT&RUN and no GRHL2 CUT&RUN datasets are currently available, it is challenging to directly compare our data with published datasets. Nevertheless, studies performing PR or ER CUT&RUN (Gillis et al., Cancer Research, 2024; Reese et al., Molecular and Cellular Biology, 2022) report a comparable number of peaks-in the same range of thousands-as observed in our data. This suggests that a single CUT&RUN experiment in general may detect fewer events than a single ChIP-seq experiment, but that the peaks that are found are likely to reflect direct binding events.

      Reviewer #2 (Significance (Required)):

      General Assessment: This study investigates the role of the transcription factor GRHL2 in modulating PR function, using RIME and CUT&RUN to explore protein-protein and protein-chromatin interactions. GRHL2 have been implicated in epithelial biology and transcriptional regulation and interaction with steroid hormone receptors has been reported. This study extends the field by showing a functional link between GRHL2 and PR, which has implications for understanding hormone-dependent gene regulation.

      The research will primarily interest a specialized audience in transcriptional regulation, chromatin biology, and hormone receptor signaling.

      Key words for this reviewer: chromatin biology, transcription factor function, epigenomics, and proteomics.

      We agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This study explores the important transcriptional coordination role of Grainyhead-like 2 (GRHL2) on the transcriptional regulatory function of progesterone receptor (PR). In this paper, the authors start with their recruitment characteristics, take into account their regulatory effects on downstream genes and their effects on the occurrence and development of breast cancer, and further clarify the coordination between them in three-dimensional space. The interaction between GRHL2 and PR, and the subsequent important influence on the co-regulated genes by GRHL2 and PR are analyzed. The overall framework of this study is mainly by RNA seq and CUT-TAG analysis, the molecular mechanism underlying the association between GRHL2 and PR and regulation function of two proteins in breast cancer is not clearly clarified. Some details need to be further improved:

      Major comments: R3.1 For Fig.1D, the molecular weight of each protein should be marked in the diagram, and the expression of GRHL2 in the input group should be supplemented.

      We apologize for not including molecular weights in our initial submission. We are not entirely clear what the reviewer means with their statement that "the expression of GRHL2 in the input group should be supplemented". The blot depicted in Figure 1D shows both the input signal and the IP. For the reviewer's information, the full Western blot is depicted below.

      Figure 3 for reviewer: Full WBs of input and IP GRHL2 samples stained for GRHL2 or PR. NT = non-treated, T = treated

      R3.2 In Fig.2B and Fig 5C, it should be describe well whether GRHL2 recruitment is in the absence or presence of R5020? How about the co-occupancy of PR and GRHL2 region, Promoter or enhancer region? It would be better to show histone marks such as H3K27ac and H3K4me1 to annotate the enhancer region.

      As also stated in our response to R1.3, we acknowledge that the ChIP-seq experiments cannot definitively determine whether GRHL2 and PR co-occupy genomic regions under ligand-stimulated conditions, since the GRHL2 dataset was generated in the absence of progesterone stimulation (as indicated in lines 167-169). To clarify this, we have now specified this detail in the legend of figure 2 by noting "untreated GRHL2 ChIP." To directly assess GRHL2 chromatin binding under progesterone-stimulated conditions, we performed CUT&RUN experiments for both GRHL2 and PR under untreated and R5020-treated conditions. These experiments revealed a subset of overlapping PR and GRHL2 binding sites (approximately 5% of all identified PR peaks.

      In our original manuscript, we performed genomic annotation of the GRHL2, PR, and GRHL2/PR overlapping peaks (Figure 2E) and found that most of these sites were located in intergenic regions, where enhancers are typically found, with ~20% located in promoter regions. We appreciate the reviewer's suggestion to further overlap the ChIP-seq peaks with histone marks such as H3K27ac and H3K4me1. We have now incorporated publicly available ATAC-seq and H3K27ac ChIP datasets in our revised manuscript (as also suggested by Reviewer 1) and find that shared GRHL2/PR sites are indeed located in active enhancer regions marked by H3K27ac (see Figure 2F). Additionally, as expected, we find that GRHL2/PR overlapping sites are enriched at open chromatin (Figure 2G).

      R3.3 What is the biological function analysis by KEGG or GO analysis for the overlapping genes from VN plots of RNA-seq with CUT-TAG peaks. The genes co-regulated by GRH2L and PR are further determined.

      For us, it is not entirely clear what reviewer 3 is asking here, but we can explain the following: as it is challenging to integrate HiChIP with CUT&RUN, due to the fundamentally different nature of the two techniques, we chose not to directly assign genes to CUT&RUN peaks. However, we did carefully link the GRHL2, PR, and GRHL2/PR ChIP-seq peaks to their target genes by integrating chromatin looping data from a PR HiChIP analysis. The result from this analysis is depicted in Figure 4B.

      As suggested by this reviewer, we also performed a GO-term analysis on the 79 genes that are regulated by both GRHL2 and PR (we now have 79 genes after the re-analysis as suggested in R1.6). The corresponding results are provided for the reviewer in figure 3 of this rebuttal (below). As this additional analysis does not provide further biological insight beyond what is already presented in Figure 4C, we decided to not include this figure in the manuscript.

      Figure 4 for reviewer: GO-term analysis on the 79 GRHL2-PR co-regulated genes that are transcriptionally regulated by GRHL2 and PR and that also harbor a PR HiChIP loop anchor at or near their TSS

      R3.4 Western blotting should be performed to determine the protein levels of downstream genes co-regulated genes by GRH2L and PR in the absence or presence of R5020.

      We agree that determining the response of co-regulated is important. Therefore, in Figure 4D, we present three representative examples of genes that are directly co-regulated by GRHL2 and PR-specifically, genes that are differentially expressed after 4 hours of R5020 exposure. Although protein levels of the targets are of functional importance, GRHL2 and PR are of transcription factors whose immediate effects are primarily exerted at the level of gene transcription. Therefore, in our opinion, changes in mRNA abundance provide the most direct and mechanistically relevant readout of their regulatory activity.

      R3.5 The author mentioned that this study positions that GRHL2 acts as a crucial modulator of steroid hormone receptor function, while the authors do not provide the evidences that how does GRHL2 regulate PR-mediated transactivation, and how about these two proteins subcellular distribution in breast cancer cells.

      We agree that while our RNA-seq data demonstrate that GRHL2 modulates the expression of PR target genes, and our CUT&RUN experiments show that GRHL2 chromatin binding is reshaped upon R5020 exposure, we have not yet further dissected the molecular mechanism by which GRHL2 functions as a PR co-regulator.

      As also mentioned in our response to R1.2, we did consider several follow-up experiments to address this, including PR CUT&RUN in GRHL2 knockdown cells, CUT&RUN for known co-activators such as KMT2C/D and P300, as well as functional studies involving GRHL2 TAD and DBD mutants. However, due to technical and logistical challenges, we were unable to carry out these experiments within the timeframe of this study.

      That said, we fully recognize that such approaches would provide deeper mechanistic insight into the interplay between PR and GRHL2. We have therefore explicitly acknowledged this limitation in our limitations of the study section (lines 502-507) and consider it an important avenue for future investigation.

      Regarding the subcellular distribution in breast cancer cells: As also mentioned in our response to R2.12, GRHL2 is well established as a nuclear protein (Zeng et al., Cancers 2024; Riethdorf et al., International Journal of Cancer 2015), and although PR is classically described as translocating to the nucleus upon hormone stimulation, several studies-including our own-have shown that PR is continuously present in the nucleus (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Frigo et al., Essays Biochem. 2021). Thus, both proteins mostly reside in the nucleus in breast (cancer) cells both in the absence and presence of hormone stimulation, but dynamic subcellular shuttling is likely to occur.

      Minor comments: Please describe in more detail the relationship between PR and GRHL2 binding independent of the hormone in the discussion section.

      As also mentioned in our response to R2.12, we have expanded the discussion to include a dedicated section addressing this point (lines 376-388).

      Reviewer #3 (Significance (Required)):

      Advance: Compare the study to existing published knowledge, it fills a gap. The authors provide RNA seq and CUT-TAG sequence analysis to show the recruitment of GRHL2 and PR and the co-regulated genes in the absence or presence of progesterone.

      Audience: breast surgery will be interested, and the audiences will cover clinical and basic research.

      My expertise is focused on the epigenetic modulation of steroid hormone receptors in the related cancers, such as breast cancer, prostate cancer, and endometrial carcinoma.

      We agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      The medicinal leech preparation is an amenable system in which to understand how the underlying cellular networks for locomotion function. A previously identified non-spiking neuron (NS) was studied and found to alter the mean firing frequency of a crawl-related motoneuron (DE-3), which fires during the contraction phase of crawling. The data are mostly solid. Identifying upstream neurons responsible for crawl motor patterning is essential for understanding how rhythmic behavior is controlled.

      Review of Revision: 

      On a positive note, the rationale for the study is clearer to me now after reading the authors' responses to both reviewers, but that information, as described in the authors' responses, is minimally incorporated into the current revised paper. Incorporating a discussion of previous work on the NS cell has, indeed, improved the paper. 

      I suggested earlier that the paper be edited for clarity but not much text has been changed since the first draft. I will provide an example of the types of sentences that are confusing. The title of the paper is: "Phase-specific premotor inhibition modulates leech rhythmic motor output". Are the authors referring to the inhibition created by premotor neurons (e.g., on to the motoneurons) or the inhibition that the premotor neurons receive? 

      In this case, this is an interesting ambiguity: NS is inhibited and that inhibition is directly transmitted to the motoneurons because both cells are electrically coupled.  We believe that the title does not disguise the findings conveyed by the manuscript.

      I also find the paper still confusing with regard to the suggested "functional homology" with the vertebrate Renshaw cells. When the authors set up this expectation of homology (should be analogy) in the introduction and other sections of the paper, one would assume that the NS cell would be directly receiving excitation from a motoneuron (like DE-3) and, in turn, the motoneuron would then receive some sort of inhibitory input to regulate its firing frequency. Essentially, I have always viewed the Renshaw cells as nature's clever way to monitor the ongoing activity of a motoneuron while also providing recurrent feedback or "recurrent inhibition" to modify that cell's excitatory state. The authors present their initial idea below on line 62. Authors write: "These neurons are present as bilateral pairs in each segmental ganglion and are functional homologs of the mammalian Renshaw cells (Szczupak, 2014). These spinal cord cells receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to the motoneurons (Alvarez and Fyffe, 2007)." 

      We agree with Reviewer #2: the correct term is "analogous," not "homologous." Thanks for pointing this out. We changed the term throughout the text.

      The Reviewer is also right in the appreciation of the role of Renshaw cells. NS plays exactly the role that the Reviewer expresses. The ONLY difference is that NS is inhibited by the motoneurons, and in turn transmits this inhibition to the motoneurons via the rectifying electrical junctions. Attending the confusion that our description caused in the Reviewer, we have modified the cited sentence accordingly now in lines 65-67.

      Minor note:

      I suggest re-writing this last sentence as "these" is confusing. Change to: 'In the spinal cord, Renshaw interneurons receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to them (Alvarez and Fyffe, 2007).'] 

      Please, see the changes mentioned above.

      Furthermore, the authors note that (line 69 on): "In the context of this circuit the activity of excitatory motoneurons evokes chemically mediated inhibitory synaptic potentials in NS. Additionally, the NS neurons are electrically coupled......In physiological conditions this coupling favors the transmission of inhibitory signals from NS to motoneurons." Based on what is being conveyed here, I see a disconnect with the "functional homology" being presented earlier. I may be missing something, but the Renshaw analogy seems to be quite different compared to what looks like reciprocal inhibition in the leech. If the authors want to make the analogy to Renshaw cells clearer, then they should make a simple ball and stick diagram of the leech system and visually compare it to the Renshaw/motoneuron circuit with regard to functionality. This simple addition would help many readers. 

      We have simplified the description regarding the Renshaw cell (lines 65-67) to avoid the “details” of the connectivity between the two circuits.

      This report focuses on NS neurons and their role in crawling; we mention the analogy with Renshaw cells to widen the interest of the results. We do not think that making a special diagram to compare how the two neurons play a similar role via different connections among the players is useful in the context of this manuscript.

      The Abstract, Authors write (line 19), "Specifically, we analyzed how electrophysiological manipulation of a premotor nonspiking (NS) neuron, that forms a recurrent inhibitory circuit (homologous to vertebrate Renshaw cells)...."

      First, a circuit would not be homologous to a cell, and the term homology implies a strict developmental/evolutionary commonality. At best, I would use the term functionally analogous but even then I am still not sure that they are functionally that similar (see comments above). 

      Reviewer #2 is right. We changed the sentence in line 20.

      Line 22: "The study included a quantitative analysis of motor units active throughout the fictive crawling cycle that shows that the rhythmic motor output in isolated ganglia mirrors the phase relationships observed in vivo." This sentence must be revised to indicate that not all of the extracellular units were demonstrated to be motor units. Revise to: "The study included a quantitative analysis of identified and putative motor units active throughout the fictive crawling cycle that shows.....' 

      Line 187 regarding identifying units as motoneurons: Authors write, "While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of motor units activated throughout the crawling cycle in this type of recordings." The authors cannot assume that the units in the recorded nerves belong only to motoneurons. Based on their first rebuttal, the authors seem to be reluctant to accept the idea that the extracellularly recorded units might represent a different class of neurons. They admit that some sensory neurons (with somata located centrally) do, indeed, travel out the same nerves recorded, but go on to explain why they would not be active. 

      The leech has a variety of sensory organs that are located in the periphery, and some of these sensory neurons do show rhythmic activity correlated with locomotor activity (see Blackshaw's early work). The numerous stretch receptors, in fact, have very large axons that pass through all the nerves recorded in the current paper. 

      In Fig. 4, it is interesting that the waveforms of all the units recorded in the PP nerve exhibit a reversal in waveform as compared to those in the DP nerve, which might indicate (based on bipolar differential recording) that the units in the PP nerve are being propagated in the opposite direction (i.e., are perhaps afferent). Rhythmic presynaptic inhibition and excitation is commonly seen for stretch receptors within the CNS (see the work of Burrows) and many such cells are under modulatory control. 

      Most likely, the majority of the units are from motoneurons, but we do not really know at this point. The authors should reframe their statements throughout the paper as: 'While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of multiple extracellular units, using spike sorting methods, which are activated throughout the crawling cycle.' In cases where the identity of the unit is known, then it is fine to state that, but when the identity of the unit is not known, then there should be some qualification and stated as 'putative motor units' 

      We understand the concern of Reviewer #2 regarding the type of neurons active during dopamine-induced crawling in isolated ganglia. However, we believe there is sufficient evidence to support that the recorded spikes originate from motoneurons. As readers may share the same concern, we have added a paragraph explaining why spikes from somatic sensory neurons such as P or T cells, or from stretch receptors, are unlikely to contribute (lines 206-214). We included the term putative in the abstract.

      The Methods section:

      Needs to include the full parameters that were used to assess whether bursting activity was qualified in ways to be considered crawling activity or not. Typically, crawl-like burst periods of no more than 25 seconds have been the limit for their qualification as crawling activity. In Fig 2F, for example, the inter-burst period is over 35 seconds; that coupled with an average 5 second burst duration would bring the burst period to 40 seconds, which is substantially out of range for there to be bursting relevant to crawl activity. Simply put, long DE-3 burst periods are often observed but may not be indicative of a crawl state as the CV motoneurons are no longer out of phase with DE-3. A number of papers have adopted this criterion. 

      We now indicate in the methods the range of period values measured in our experiments.  For the reviewer informatio we show here histograms depicting the variability of period and duty cycle values recorded in our experiments (control conditions). The Reviewer can see that the bursting activity of DE-3 fall within what has been published.

      Author response image 1.

      Crawling in isolated ganglia. A. Histogram of periods end-to-end during crawling in isolated ganglia. The dotted line indicates the mean obtained from the averages of all experiments. The solid black line represents the mean of all cycles across all experiments. B. As in A, for the duty cycle calculated using end-to-end periods.  (n = 210 cycles from 45 ganglia obtained from 32 leeches in all cases).

      Reviewer #1 (Recommendations for the authors): 

      Minor comments-

      Line 100: "In the frame of the recurrent inhibitory circuit, NS is the target of inhibitory signals". Suggestion: 'Within the framework of the recurrent inhibitory circuit, NS is the target of inhibitory signals.' 

      Changed as suggested (line 107).

      Line 163: "This series of experiments proves that, as predicted based on the known circuit (Figure 164 1C), inhibitory signals onto NS premotor neurons were transmitted to DE-3 motoneurons and counteracted their excitatory drive during crawling, limiting their firing frequency". I think this sentence is too strong plus needs some editing. Suggestion: 'As predicted based on the known circuit (Figure 164 1C), this series of experiments indicates that inhibitory signals onto NS premotor neurons are transmitted to DE-3 motoneurons, thus limiting their firing frequency and counteracting their excitatory drive during crawling."

      Changed as suggested.

      Lines 86, 292 and 304 and Fig 4 legend: "Different from DE-3, In-Phase units showed a marked decrease in the maximum bFF along time." Suggestion: Replace the word "along" with 'across' time. Also replace those words in the Fig 4 legend and Line 80...."along" (replace with 'across') the different stages of crawling. 

      Changed as suggested.

      Line 311: "bursts and a concurrent inhibitory input via NS (Figure 7). Coherent with this interpretation, the activity level of the Anti- Phase units was not influenced by these inhibitory signals". Suggestion: Replace the word "coherent" with 'consistent'. 

      Changed as suggested.

      Line 332: "...offer the particular advantage of allowing electrical manipulation of individual neurons in wildtype adults," I am unsure what the authors are attempting to convey. Not sure what they mean by "wildtype" in this context and why that would matter. 

      “wildtype” was eliminated

      We thank Reviewer #2 for the suggested edits to the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study advances the lab's growing body of evidence exploring higher-order learning and its neural mechanisms. They recently found that NMDA receptor activity in the perirhinal cortex was necessary for integrating stimulus-stimulus associations with stimulus-shock associations (mediated learning) to produce preconditioned fear, but it was not necessary for forming stimulus-shock associations. On the other hand, basolateral amygdala NMDA receptor activity is required for forming stimulus-shock memories. Based on these facts, the authors assessed: (1) why the perirhinal cortex is necessary for mediated learning but not direct fear learning, and (2) the determinants of perirhinal cortex versus basolateral amygdala necessity for forming direct versus indirect fear memories. The authors used standard sensory preconditioning and variants designed to manipulate the novelty and temporal relationship between stimuli and shock and, therefore, the attentional state under which associative information might be processed. Under experimental conditions where information would presumably be processed primarily in the periphery of attention (temporal distance between stimulus/shock or stimulus pre-exposure), perirhinal cortex NMDA receptor activation was required for learning indirect associations. On the other hand, when information would likely be processed in focal attention (novel stimulus contiguous with shock), basolateral amygdala NMDA activity was required for learning direct associations. Together, the findings indicate that the perirhinal cortex and basolateral amygdala subserve peripheral and focal attention, respectively. The authors provide support for their conclusions using careful, hypothesis-driven experimental design, rigorous methods, and integrating their findings with the relevant literature on learning theory, information processing, and neurobiology. Therefore, this work will be highly interesting to several fields.

      Strengths:

      (1) The experiments were carefully constructed and designed to test hypotheses that were rooted in the lab's previous work, in addition to established learning theory and information processing background literature.

      (2) There are clear predictions and alternative outcomes. The provided table does an excellent job of condensing and enhancing the readability of a large amount of data.

      (3) In a broad sense, attention states are a component of nearly every behavioral experiment. Therefore, identifying their engagement by dissociable brain areas and under different learning conditions is an important area of research.

      (4) The authors clearly note where they replicated their own findings, report full statistical measures, effect sizes, and confidence intervals, indicating the level of scientific rigor.

      (5) The findings raise questions for future experiments that will further test the authors' hypotheses; this is well discussed.

      Weaknesses:

      As a reader, it is difficult to interpret how first-order fear could be impaired while preconditioned fear is intact; it requires a bit of "reading between the lines".

      We appreciate the Reviewer’s point and have attempted to address on lines 55-63 of the revised paper: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      Reviewer #2 (Public review):

      Summary:

      This paper continues the authors' research on the roles of the basolateral amygdala (BLA) and the perirhinal cortex (PRh) in sensory preconditioning (SPC) and second-order conditioning (SOC). In this manuscript, the authors explore how prior exposure to stimuli may influence which regions are necessary for conditioning to the second-order cue (S2). The authors perform a series of experiments which first confirm prior results shown by the author - that NMDA receptors in the PRh are necessary in SPC during conditioning of the first-order cue (S1) with shock to allow for freezing to S2 at test; and that NMDA receptors in the BLA are necessary for S1 conditioning during the S1-shock pairings. The authors then set out to test the hypothesis that the PRh encodes associations in a peripheral state of attention, whereas the BLA encodes associations in a focal state of attention, similar to the A1 and A2 states in Wagner's theory of SOP. To do this, they show that BLA is necessary for conditioning to S2 when the S2 is first exposed during a serial compound procedure - S2-S1-shock. To determine whether pre-exposure of S2 will shift S2 to a peripheral focal state, the authors run a design in which S2-S1 presentations are given prior to the serial compound phase. The authors show that this restores NMDA receptor activity within the PRh as necessary for the fear response to S2 at test. They then test whether the presence of S1 during the serial compound conditioning allows the PRh to support the fear responses to S2 by introducing a delay conditioning paradigm in which S1 is no longer present. The authors find that PRh is no longer required and suggest that this is due to S2 remaining in the primary focal state.

      Strengths:

      As with their earlier work, the authors have performed a rigorous series of experiments to better understand the roles of the BLA and PRh in the learning of first- and second-order stimuli. The experiments are well-designed and clearly presented, and the results show definitive differences in functionality between the PRh and BLA. The first experiment confirms earlier findings from the lab (and others), and the authors then build on their previous work to more deeply reveal how these regions differ in how they encode associations between stimuli. The authors have done a commendable job of pursuing these questions.

      Table 1 is an excellent way to highlight the results and provide the reader with a quick look-up table of the findings.

      Weaknesses:

      The authors have attempted to resolve the question of the roles of the PRh and BLA in SPC and SOC, which the authors have explored in previous papers. Laudably, the authors have produced substantial results indicating how these two regions function in the learning of first- and second-order cues, providing an opportunity to narrow in on possible theories for their functionality. Yet the authors have framed this experiment in terms of an attentional framework and have argued that the results support this particular framework and hypothesis - that the PRh encodes peripheral and the BLA encodes focal states of learning. This certainly seems like a viable and exciting hypothesis, yet I don't see why the results have been completely framed and interpreted this way. It seems to me that there are still some alternative interpretations that are plausible and should be included in the paper.

      We appreciate the Reviewer’s point and have attempted to address it on lines 566-594 of the Discussion: “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      Reviewer #3 (Public review):

      Summary:

      This manuscript presents a series of experiments that further investigate the roles of the BLA and PRH in sensory preconditioning, with a particular focus on understanding their differential involvement in the association of S1 and S2 with shock.

      Strengths:

      The motivation for the study is clearly articulated, and the experimental designs are thoughtfully constructed. I especially appreciate the inclusion of Table 1, which makes the designs easy to follow. The results are clearly presented, and the statistical analyses are rigorous. My comments below mainly concern areas where the writing could be improved to help readers more easily grasp the logic behind the experiments.

      Weaknesses:

      (1) Lines 56-58: The two previous findings should be more clearly summarized. Specifically, it's unclear whether the "mediated S2-shock" association occurred during Stage 2 or Stage 3. I assume the authors mean Stage 2, but Stage 2 alone would not yet involve "fear of S2," making this expression a bit confusing.

      We apologise for the confusion and have revised the summary of our previous findings on lines 55-63. The revised text now states: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (2) Line 61: The phrase "Pavlovian fear conditioning" is ambiguous in this context. I assume it refers to S1-shock or S2-shock conditioning. If so, it would be clearer to state this explicitly.

      Apologies for the ambiguity - we have omitted the term “Pavlovian” which may have been the source of confusion: The revised text on lines 60-63 now states: “These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (3) Regarding the distinction between having or not having Stage 1 S2-S1 pairings, is "novel vs. familiar" the most accurate way to frame this? This terminology could be misleading, especially since one might wonder why S2 couldn't just be presented alone on Stage 1 if novelty is the critical factor. Would "outcome relevance" or "predictability" be more appropriate descriptors? If the authors choose to retain the "novel vs. familiar" framing, I suggest providing a clear explanation of this rationale before introducing the predictions around Line 118.

      We have incorporated the suggestion regarding “predictability” while also retaining “novelty” as follows. 

      L76-85: “For example, different types of arrangements may influence the substrates of conditioning to S2 by influencing its novelty and/or its predictive value at the time of the shock, on the supposition that familiar stimuli are processed in the periphery of attention and, thereby, the PRh (Bogacz & Brown, 2003; Brown & Banks, 2015; Brown & Bashir, 2002; Martin et al., 2013; McClelland et al., 2014; Morillas et al., 2017; Murray & Wise, 2012; Robinson et al., 2010; Suzuki & Naya, 2014; Voss et al., 2009; Yang et al., 2023) whereas novel stimuli are processed in the focus of attention and, thereby, the amygdala (Holmes et al., 2018; Qureshi et al., 2023; Roozendaal et al., 2006; Rutishauser et al., 2006; Schomaker & Meeter, 2015; Wright et al., 2003).”

      L116-120: “Subsequent experiments then used variations of this protocol to examine whether the engagement of NMDAR in the PRh or BLA for Pavlovian fear conditioning is influenced by the novelty/predictive value of the stimuli at the time of the shock (second implication of theory) as well as their distance or separation from the shock (third implication of theory; Table 1).”

      (4) Line 121: This statement should refer to S1, not S2.

      (5) Line 124: This one should refer to S2, not S1.

      We have checked the text on these lines for errors and confirmed that the statements are correct. The lines encompassing this text (L121-130) are reproduced here for convenience:

      (1) When rats are exposed to novel S2-S1-shock sequences, conditioning of S2 and S1 will be disrupted by a DAP5 infusion into the BLA but not into the PRh (Experiments 2A and 2B);

      (2) When rats are exposed to S2-S1 pairings and then to S2-S1-shock sequences, conditioning of S2 will be disrupted by a DAP5 infusion into the PRh but not the BLA whereas conditioning of S1 will be disrupted by a DAP5 infusion into the BLA not the PRh (Experiments 3A and 3B);

      (3) When rats are exposed to S2-S1 pairings and then to S2 (trace)-shock pairings, conditioning of S2 will be disrupted by a DAP5 into the BLA not the PRh (Experiments 4A and 4B).

      (6) Additionally, the rationale for Experiment 4 is not introduced before the Results section. While it is understandable that Experiment 4 functions as a follow-up to Experiment 3, it would be helpful to briefly explain the reasoning behind its inclusion.

      Experiment 4 follows from the results obtained in Experiment 3; and, as noted, the reasoning for its inclusion is provided locally in its introduction. We attempted to flag this experiment earlier in the general introduction to the paper; but this came at the cost of clarity to the overall story. As such, our revised paper retains the local introduction to this experiment. It is reproduced here for convenience:

      “In Experiments 3A and 3B, conditioning of the pre-exposed S1 required NMDAR-activation in the BLA and not the PRh; whereas conditioning of the pre-exposed S2 required NMDAR-activation in the PRh and not the BLA. We attributed these findings to the fact that the pre-exposed S2 was separated from the shock by S1 during conditioning of the S2-S1-shock sequences in stage 2: hence, at the time of the shock, S2 was no longer processed in the focal state of attention supported by the BLA; instead, it was processed in the peripheral state of attention supported by the PRh.

      “Experiments 4A and 4B employed a modification of the protocol used in Experiments 3A and 3B to examine whether a pre-exposed S1 influences the processing of a pre-exposed S2 across conditioning with S2-S1-shock sequences. The design of these experiments is shown in Figure 4A. Briefly, in each experiment, two groups of rats were exposed to a session of S2-S1 pairings in stage 1 and, 24 hours later, a session of S2-[trace]-shock pairings in stage 2, where the duration of the trace interval was equivalent to that of S1 in the preceding experiments. Immediately prior to the trace conditioning session in stage 2, one group in each experiment received an infusion of DAP5 or vehicle only into either the PRh (Experiment 4A) or BLA (Experiment 4B). Finally, all rats were tested with presentations of the S2 alone in stage 3. If the substrates of conditioning to S2 are determined only by the amount of time between presentations of this stimulus and foot shock in stage 2, the results obtained in Experiments 4A and 4B should be the same as those obtained in Experiments 3A and 3B: acquisition of freezing to S2 will require activation of NMDARs in the PRh and not the BLA. If, however, the presence of S1 in the preceding experiments (Experiments 3A and 3B) accelerated the rate at which processing of S2 transitioned from the focus of attention to its periphery, the results obtained in Experiments 4A and 4B will differ from those obtained in Experiments 3A and 3B. That is, in contrast to the preceding experiments where acquisition of freezing to S2 required NMDAR-activation in the PRh and not the BLA, here acquisition of freezing to S2 should require NMDAR-activation in the BLA but not the PRh.”

      Reviewer #1 (Recommendations for the authors):

      I greatly enjoyed reading and reviewing this manuscript, and so I only have boilerplate recommendations.

      (1) I might add a couple of sentences discussing how/why preconditioned fear could be intact while first-order fear is impaired. Of course, if I am interpreting the provided interpretation correctly, the reason is that peripheral processing is still intact even when BLA NMDA receptors are blocked, and so mediated conditioning still occurs. Does this mean that mediated conditioning does not require learning the first-order relationship, and that they occur in parallel? Perhaps I just missed this, but I cannot help but wonder whether/how the psychological processes at play might change when first-order learning is impaired, so this would be greatly appreciated.

      As noted above, we have revised the general introduction (around lines 55-59) to clarify that the direct S1-shock and mediated S2-shock associations form in parallel. Hence, manipulations that disrupt first-order fear to the S1 (such as a BLA infusion of the NMDA receptor antagonist, DAP5) do not automatically disrupt the expression of sensory preconditioned fear to the S2.

      (2) Adding to the above - does the SOP or another theory predict serial vs parallel information flow from focal state to peripheral, or perhaps it is both to some extent?

      SOP predicts both serial and parallel processing of information in its focal and peripheral states. That is, some proportion of the elements that comprise a stimulus may decay from the focal state of attention to the periphery (serial processing); hence, at any given moment, the elements that comprise a stimulus can be represented in both focal and peripheral states (parallel processing).

      Given the nature of the designs and tools used in the present study (between-subject assessment of a DAP5 effect in the BLA or PRh), we selected parameters that would maximize the processing of the S2 and S1 stimuli in one or the other state of activation; hence the results of the present study. We are currently examining the joint processing of stimulus elements across focal and peripheral states using simultaneous recordings of activity in the BLA and PRh. These recordings are collected from rats trained in the different stages of a within-subject sensory preconditioning protocol. The present study created the basis for this work, which will be published separately in due course.

      (3) The organization of PRh vs BLA is nice and consistent across each figure, but I would suggest adding any kind of additional demarcation beyond the colors and text, maybe just more space between AB / CD. The figure text indicating PRh/BLA is a bit small.

      Thank you for the suggestion – we have added more space between the top and bottom panels of the figure.

      (4) Line 496 typo ..."in the BLA but not the BLA".

      Apologies for the type - this has been corrected.

      Reviewer #2 (Recommendations for the authors):

      I found the experiments to be extremely well-designed and the results convincing and exciting. The hypothesis of the focal and peripheral states of attention being encoded by BLA and PRh respectively, is enticing, yet as indicated in the public review, this does not seem to be the only possible interpretation. This is my only serious comment for the authors.

      (1) I think it would be worth reframing the article slightly to give credence to alternative hypotheses. Not to say that the authors' intriguing hypothesis shouldn't be an integral part of the introduction, but no alternatives are mentioned. In experiment 2, could the fact that S2 is already being a predictor of S1, not block new learning to S2? In the framework of stimulus-stimulus associations, there would be no surprise in the serial-compound stage of conditioning at the onset of S1. This may prevent direct learning of the S2-shock association within the BLA. This type of association may as well (S2 predicts S1, but it's omitted), which could support learning by S2. fall under the peripheral/focal theory, but I don't think it's necessary to frame this possibility in terms of a peripheral/focal theory. To build on this alternative interpretation, the absence of S1 in experiment 4 may induce a prediction error. The peripheral and focal states appear to correspond to A2 and A1 in SOP extremely well, and I think it would potentially add interest and support. If the authors do intend to make the paper a strong argument for their hypothesis, perhaps a few additional experiments may be introduced. If the novelty of S2 is critical for S2 not to be processed in a focal state during the serial compound stage, could pre-exposure of S2 alone allow for dependence of S2-shock on the PRh? Assuming this is what the authors would predict, this might disentangle the S-S theory mentioned above from the peripheral/focal theory. Or perhaps run an experiment S2-X in stage 1 and S2-S1-shock in stage 2? This said, I think the experiments are more than sufficient for an exciting paper as is, and I don't think running additional experiments is necessary. I would only argue for this if the authors make a hard claim about the peripheral/focal theory, as is the case for the way the paper is currently written.

      We appreciate the reviewer’s excellent point and suggestions. We have included an additional paragraph in the Discussion on page 24 (lines 566-594).  “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      (3) I was surprised the authors didn't frame their hypothesis more in terms of Wagner's SOP model. It was minimally mentioned in the introduction or the authors' theory if it were included more in the introduction. I was wondering whether the authors may have avoided this framing to avoid an expectation for modeling SOP in their design. If this were the case, I think the paper stands on its own without modeling, and at least for myself, a comparison to SOP would not require modeling of SOP. If this was the authors' concern for avoiding it, I would suggest to the authors that they need not be concerned about it.

      We appreciate the endorsement of Wagner’s SOP theory as a nice way of framing our results. We are currently working on a paper in which we use simulations to show how Wagner’s theory can accommodate the present findings as well as others in the literature on sensory preconditioning. For this reason, we have not changed the current paper in relation to this point.

    1. Author response:

      Reviewer #1 (Public review)

      I have to preface my evaluation with a disclosure that I lack the mathematical expertise to fully assess what seems to be the authors' main theoretical contribution. I am providing this assessment to the best of my ability, but I cannot substitute for a reviewer with more advanced mathematical/physical training.

      Summary:

      This paper describes a new theoretical framework for measuring parsimony preferences in human judgments. The authors derive four metrics that they associate with parsimony (dimensionality, boundary, volume, and robustness) and measure whether human adults are sensitive to these metrics. In two tasks, adults had to choose one of two flower beds which a statistical sample was generated from, with or without explicit instruction to choose the flower bed perceptually closest to the sample. The authors conduct extensive statistical analyses showing that humans are sensitive to most of the derived quantities, even when the instructions encouraged participants to choose only based on perceptual distance. The authors complement their study with a computational neural network model that learns to make judgments about the same stimuli with feedback. They show that the computational model is sensitive to the tasks communicated by feedback and only uses the parsimony-associated metrics when feedback trains it to do so.

      Strengths:

      (1)  The paper derives and applies new mathematical quantities associated with parsimony. The mathematical rigor is very impressive and is much more extensive than in most other work in the field, where studies often adopt only one metric (such as the number of causes or parameters). These formal metrics can be very useful for the field.

      (2)  The studies are preregistered, and the statistical analyses are strong.

      (3)  The computational model complements the behavioral findings, showing that the derived quantities are not simply equivalent to maximum-likelihood inference in the task.

      (4)  The speculations in the discussion section (e.g., the idea that human sensitivity is driven by the computational demands each metric requires) are intriguing and could usefully guide future work.

      Weaknesses:

      (1) The paper is very hard to understand. Many of the key details of the derived metrics are in the appendix, with very little accessible explanation in the main text. The figures helped me understand the metrics somewhat, although I am still not sure how some of them (such as boundary or robustness as measured here) are linked to parsimony. I understand that this is addressed by the derivations in the appendix, but as a computational cognitive scientist, I would have benefited from more accessible explanations. Important aspects of the human studies are also missing from the main text, such as the sample size for Experiment 2.

      (2) It is not fully clear whether the sensitivity of human participants to some of the quantities convincingly reported here actually means that participants preferred shapes according to the corresponding aspect of parsimony. The title and framing suggest that parsimony "guides" human decision-making, which may lead readers to conclude that humans prefer more parsimonious shapes. I am not sure the sensitivity findings alone support this framing, but it might just be my misunderstanding of the analyses.

      (3) The stimulus set included only four combinations of shapes, each designed to diagnostically target one of the theoretical quantities. It is unclear whether the results are robust or specific to these particular 4 stimuli.

      (4) The study is framed as measuring "decision-making," but the task resembles statistical inference (e.g., which shape generated the data) or perceptual judgment. This is a minor point since "decision-making" is not well defined in the literature, yet the current framing in the title gave me the initial impression that humans would be making preference choices and learning about them over time with feedback.

      We are grateful for the supportive comments highlighting the rigor of our experimental design and data analysis. The Reviewer lists four points under “weaknesses”, to which we reply below. 

      (1)  The paper is very hard to understand

      In the revised version of the paper, we will expand the main text to include a more detailed and intuitive description of the terms of the Fisher Information Approximation, in particular clarifying the interpretation of robustness and boundary as parsimony. We also will include more details that are now given only in Methods, such as the sample size for the second experiment. 

      (2) Sensitivity of human participants 

      We do argue, and believe, that our data show that people tend to prefer simpler shapes. However, giving a well-posed definition of "preference" in this context turns out to be nontrivial.

      At the very least, any statement such as "people prefer shape A over B" should be qualified with something like “when the distance of the data from both shapes is the same.” In other words, one should control for goodness-of-fit. Even before making any reference to our behavioral model, this phenomenon (a preference for the simpler model when goodness of fit is matched between models) is visible in Figure 3a, where the effective decision boundary used by human participants is closer to the more complex model than the cyan line representing the locus of points with equal goodness of fit under the two models (or equivalently, with the same Euclidean distance from the two shapes). The goal of our theory and our behavioral model is precisely to systematize this sort of control, extending it beyond just goodness-of-fit and allowing us to control simultaneously for multiple features of model complexity that may affect human behavior in different ways. In other words, it allows us not only to ask whether people prefer shape A over B after controlling for the distance of the data to the shapes, but also to understand to what extent this preference is driven by important geometrical features such as dimensionality, volume, curvature, and boundaries of the shapes. More specifically, and importantly, our theory makes it possible to measure the strength of the preference, rather than merely asserting its existence. In our modeling framework, the existence of a preference for simpler shapes is captured by the fact that the estimated sensitivities to the complexity penalties are positive (and although they differ in magnitude, all are statistically reliable).

      (3) Generalization to different shapes  

      Thank you for bringing up this important topic. First, note that while dimensionality and volume are global properties of models and only take two possible values in our human tasks, the boundary and robustness penalties depend on the model and on the data and therefore assume a continuum of values through the tasks (note also that the boundary penalty is relevant for all task types, not just the one designed specifically to study it, because all models except the zero-dimensional dot have boundaries). Therefore, our experimental setting is less restrictive of what it may seem, because it explores a range of possible values for two of the four model features. However, we agree that it would be interesting to repeat our experiment with a broader range of models, perhaps allowing their dimensionality and volume to vary more. In the same spirit, it would be interesting to study the dependence of human behavior on the amount of available data. We believe that these are all excellent ideas for further study that exceed the scope of the present paper. We will include these important points in a revised Discussion. 

      (4) Usage of “decision making” vs “perceptual judgment”

      Thank you. We will clarify better in the text that our usage of “decision making” overlaps with the idea of a perceptual judgment and that our experiments do not tackle sequential aspects of repeated decisions. 

      Reviewer #2 (Public review):

      This manuscript presents a sophisticated investigation into the computational mechanisms underlying human decision-making, and it presents evidence for a preference for simpler explanations (Occam's razor). The authors dissect the simplicity bias into four different components, and they design experiments to target each of them by presenting choices whose underlying models differ only in one of these components. In the learning tasks, participants must infer a "law" (a logical rule) from observed data in a way that operationalizes the process of scientific reasoning in a controlled laboratory setting. The tasks are complex enough to be engaging but simple enough to allow for precise computational modeling.

      As a further novel feature, authors derive a further term in the expansion of the logevidence, which arises from boundary terms. This is combined with a choice model, which is the one that is tested in experiments. Experiments are run, but with humans and with artificial intelligence agents, showing that humans have an enhanced preference for simplicity as compared to artificial neural networks.

      Overall, the work is well written, interesting, and timely, bridging concepts in statistical inference and human decision making. Although technical details are rather elaborate, my understanding is that they represent the state of the art.

      I have only one main comment that I think deserves more comments. Computing the complexity penalty of models may be hard. It is unlikely that humans can perform such a calculation on the fly. As authors discuss in the final section, while the dimensionality term may be easier to compute, others (e.g., the volume term, which requires an integral) may be considerably harder to compute (it is true that they should be computed once and for all for each task, but still...). I wonder whether the sensitivity of human decision making with reference to the different terms is so different, and in particular whether it aligns with computational simplicity, or with the possibility of approximating each term by simple heuristics. Indeed, the sensitivity to the volume term is significantly and systematically lower than that of other terms. I wonder whether this relation could be made more quantitative using neural networks, using as a proxy of computational hardness the number of samples needed to reach a given error level in learning each of these terms.

      Thank you. The computational complexity associated with calculating the different terms and its potential connection to human sensitivity to the terms is an intriguing topic. As we hinted at in the discussion, we agree with the reviewer that this is a natural candidate for further research, which likely deserves its own study and exceeds the scope of the present paper. 

      As a minor aside, at least for the present task the volume term may not be that hard to compute, because it can be expressed with the number of distinguishable probability distributions in the model (Balasubramanian 1996). Given the nature of our task, where noise is Gaussian, isotropic and with known variance, the geometry of the model is actually the Euclidean geometry of the plane, and the volume is simply the (log of the) length of the line that represents the one-dimensional models, measured in units of the standard deviation of the noise.

      Reviewer #3 (Public review):

      Summary:

      This is a very interesting paper that documents how humans use a variety of factors that penalize model complexity and integrate over a possible set of parameters within each model. By comparison, trained neural networks also use these biases, but only on tasks where model selection was part of the reward structure. In the situation where training emphasizes maximum-likelihood decisions, only neural networks, but not humans, were able to adapt their decision-making. Humans continue to use model integration simplicity biases.

      Strengths:

      This study used a pre-registered plan for analyzing human data, which exceeds the standards compared to other current studies.

      The results are technically correct.

      Weaknesses:

      The presentation of the results could be improved.

      We thank the reviewer for their appreciation of our experimental design and methodology, and for pointing out (in the separate "recommendations to authors") a few passages of the paper where the presentation could be improved. We will clarify these passages in the revision.

    1. Reviewer #1 (Public review):

      Summary:

      This is a careful, well-powered treatment of age effects in resting-state MEG. Rather than extracting (say) complex connectivity measures, the authors look at the 'simplest possible thing': changes in the overall power spectrum across age.

      Strengths:

      They find significant age-related changes at different frequency bands: broadly, attenuation at low-frequency (alpha) and increased beta. These patterns are identified in a large dataset (CamCAN) and then verified in other public data.

      Weaknesses:

      Some secondary interpretations (what is "unique" to age vs global anatomy) may go beyond what the statistics strictly warrant in the current form, but these can be tightened with (I think, fairly quick) additions already foreshadowed by the authors' own analyses.

      Aims:

      The authors set out to replace piecemeal, band-by-band ageing claims with t-maps, and Cohen's f2 over sensors×frequency ("GLM-Spectrum").

      On CamCAN, six spatio-spectral peaks survive relatively strict statistical controls. The larger effects are in low-frequency and upper-alpha/beta ranges (f2 approx 0.2-0.3), while lower-alpha and gamma reach significance but with small practical impact (f2 < 0.075). A nice finding is that the same qualitative profile appears in three additional independent datasets.

      Two analyses are especially interesting. First, the authors show a difference between absolute and relative spectral magnitude (basically, within-subject normalization). Relative scaling sharpens the spectral specificity of the spatial maps, while absolute magnitude is dominated by a broad spatial mode that correlates positively across frequencies, likely reflecting head-position/field-spread factors. The replication of the main age profile is robust to preprocessing decisions (e.g., SSS movement compensation choices) - the bigger determinant of the effect is whether they apply sensor normalization (relative vs absolute).

      Second, lots of brain-related things might be related to age, and the authors spend some time trying to back out confounds/covariates. This section is handled transparently (in general, I found the writing style very clear throughout) - they examine single covariates (sex, BP, GGMV, etc.) and compare simple vs partial age effects. For example, aging is correlated with reductions in global grey-matter volume (GGMV), but it would be nice to find a measure that is independent of this: controlling for GGMV (via a linear model) reduces age-related effect sizes heterogeneously across space/frequency but does not eliminate them, a nuance the authors treat carefully.

      This is a nice paper, and I have only a few concrete suggestions:

      (1) High-gamma:

      There can be a lot of EMG / eye movement contamination (I know these were RS eyes closed data, but still..) above 30-40 Hz, and these effects are the weakest anyway. Could you add an analysis (e.g., ICA/label-based muscle component removal) and show the gamma band's sensitivity to that step? Or just note this point more clearly?

      (2) GGMV confound control:

      Controlling for GGMV reduces, but does not eliminate, age effects. I have a few questions about this: a) Could we see the residuals as a function of age? I wonder if there are non-linear effects or something else that the regression is not accounting for. Also, b) GGMV and age are highly colinear - is this an issue? Can regression really split them apart robustly? I think by some cunning orthogonalisation, you can compute the effect of age independent of GGVM. I don't think this is the same as the effect 'adjusted' for GGMV (which is what is shown here if I'm reading it correctly). Finally, of course, GGMV might actually be the thing you want to look at (because it might more accurately reflect clinical issues) - so strong correlations are not really a problem: I think really the focus might even be on using MEG to predict GGMV and controlling for age.

    1. experiments and hardware design have a certain “latency” and need to be iterated upon a certain “irreducible” number of times in order to learn things that can’t be deduced logically. But massive parallelism may be possible on top of that

      If it ends up developing this far I think the success rate of what could be discovered is endless but I think the driving factor of that will be the parallelism especially since as mentioned experiments take so long that in the time one is being done there are so many other things that can be ran and tested as well but we don't have the people or resources to do so right now.

  2. Dec 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1)How is this simplified model representative of what is observed biologically? A bump model does not naturally produce oscillations. How would the dynamics of a rhythm generator interact with this simplistic model?

      Bump models naturally produce sequential activity, and can be engineered to repeat this sequential activity periodically (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). This is the basis for the oscillatory behavior in the model presented here. As we describe in our paper, such a model is consistent with numerous neurobiological observations about cell-type-specific connectivity patterns. The reviewer is, however, correct to point out that our model does not incorporate other key neurobiological features--in particular, intracellular dynamical properties--that have been shown to play important roles in rhythm generation. Our aim in this work is to establish a circuit-level mechanism for rhythm generation, complementary to classical models that rely on intracellular dynamics for rhythm generation. Whether and how these mechanisms work together is something that we plan to explore in future work, and we have added a sentence to the Discussion to this effect.

      (2) Would this theoretical construct survive being expressed in a biophysical model? It seems that it should, but even a simple biological model with the basic patterns of connectivity shown here would greatly increase confidence in the biological plausibility of the theory.

      We thank the reviewer for pointing out this way to strengthen our paper. We implemented the connectivity developed in the rate models in a spiking neuron model which used EI-balanced Poisson noise as input drive. We found that we could reproduce all the main results of our analysis. In particular, with a realistic number of neurons, we observed swimming activity characterized by (i) left-right alternation, (ii) rostal-caudal propagation, and (iii) variable speed control with constant phase lag. The spiking model demonstrates that the connectivity-motif based mechanisms for rhythmogenesis that we propose are robust in a biophysical setting.

      We included these results in the updated manuscript in a new Results subsection titled “Robustness in a biophysical model.”

      (3) How stable is this model in its output patterns? Is it robust to noise? Does noise, in fact, smooth out the abrupt transitions in frequency in the middle range?

      The newly added spiking model implementation of the network demonstrates that the core mechanisms of our models are robust to noise,  since the connectivity is randomly chosen and the input drive is Poisson noise.

      To test the effect of noise as it is parametrically varied, we also added noise directly to the rate models in the form of white noise input to each unit. Namely, the rate model was adapted to obey the stochastic differential equation

      \[

      \tau_i \frac{dr_i(t)}{dt} = -r_i(t) + \left[ \sum_j W_{ij} r_j(t - \Delta_{ij}) + D_i + \sigma\xi_t \right]_+

      \]

      Here $\xi_t$ is a standard Gaussian white noise and $\sigma$ sets the strength of the noise. We found that the swimming patterns were robust at all frequencies up to $\sigma =  0.05$. Above this level, coherent oscillations started to break down for some swim frequencies. To investigate whether the noise smoothed out abrupt transitions, we swept through different values of noise and modularity of excitatory connections. The results showed very minor improvement in controllability (see figure below), but this was not significant enough to include in the manuscript.

      Author response image 1.

      (4) All figure captions are inadequate. They should have enough information for the reader to understand the figure and the point that was meant to be conveyed. For example, Figure 1 does not explain what the red dot is, what is black, what is white, or what the gradations of gray are. Or even if this is a representative connectivity of one node, or if this shows all the connections? The authors should not leave the reader guessing.

      All figure captions have been updated to enhance clarity and address these concerns.

      Reviewer #2 (Public review):

      (1) Figure 1A, if I interpret Figure 1B correctly, should there not be long descending projections as well that don't seem to be illustrated?

      Thank you for highlighting this potential point of confusion. The diagram in question was only intended to be a rough schematic of the types of connections present in the model. We have added additional descending connections as requested

      (2)Page 5, It would be good to define what is meant by slow and fast here, as this definition changes with age in zebrafish (what developmental age)?

      We have updated the manuscript to include the sentence: “These values were chosen to coincide with observed ranges from larval zebrafish.” with appropriate citation.

      Reviewer #3 (Public review):

      (1) The authors describe a single unit as a neuron, be it excitatory or inhibitory, and the output of the simulation is the firing rate of these neurons. Experimentally and in other modeling studies, motor neurons are incorporated in the model, and the output of the network is based on motor neuron firing rate, not the interneurons themselves. Why did the authors choose to build the model this way?

      We chose to leave out the motor neurons from our models for a few reasons. While motor neurons read out the rhythmic activity generated by the interneurons and may provide some feedback, they are not required for rhythmogenesis. In fact, interneuron activity (especially in the excitatory V2a neurons (Agha et al., 2024)) is highly correlated with the ventral root bursts within the same segment. This suggests that motor neurons are primarily a local readout of the rhythmic activity of interneurons; therefore, the rhythmic swimming activity can be deduced directly from the interneurons themselves.

      Moreover, there is a lack of experimental observation of the connectivity between all the cell types considered in our model and motor neurons. Hence, it was unclear how we should include them in the model. To address this, we are currently developing a data-driven approach that will determine the proper connectivity between the motor neurons and the interneurons, including intrasegmental connections.

      (2) In the single population model (Figure 1), the authors use ipsilateral inhibitory connections that are long-range in an ascending direction. Experimentally, these connections have been shown to be local, while long-range ipsilateral connections have been shown to be descending. What were the reasons the authors chose this connectivity? Do the authors think local ascending inhibitions contribute to rostrocaudal propagation, and how?

      The long-range ascending ipsilateral inhibitory connections arises from a limitation of our modeling framework. The V1 neurons that provide these connections have been shown experimentally to fire later than other neurons (especially descending V2a  neurons) within the same hemisegment (Jay et al., J Neurosci, 2023); however, our model can only produce synchronized local activity. Hence, we replace local phase offsets with spatial offsets to produce correctly structured recurrent phasic inputs. We are currently investigating a data-driven method for determining intrasegmental connectivity which should be able to produce the local phase offset and address this concern; however, this is beyond the scope of the current paper.

      (3) In the two-population model, the authors show independent control of frequency and rhythm, as has been reported experimentally. However, in these previous experimental studies, frequency and amplitude are regulated by different neurons, suggesting different networks dedicated to frequency and amplitude control. However, in the current model, the same population with the same connections can contribute to frequency or amplitude depending on relative tonic drive. Can the authors please address these differences either by changes in the model or by adding to the Discussion?

      Our prior  experimental results that suggested a separation of frequency and amplitude control circuits focus on motor neuron recruitment, instead of interneuron activity (Jay et al., J Neurosci 2023; Menelaou and McLean, Nat Commun 2019). To avoid potential confusion about amplitudes of interneurons vs. of motor neurons, we have removed the results from Figure 3 about control of amplitude in the 2-population model, instead focusing this figure on the control of frequency via speed-module recruitment. For the same reason, we have removed the panel showing the effects of targeted ablations on interneuron amplitudes in Figure 7. We have kept the result about amplitude control in our Supplemental Figure S2 for the 8-population model, but we try to make it clear in the text that any relationship between interneuron amplitude and motor neuron amplitude would depend on how motor neurons are modeled, which we do not pursue in this work.

      (4) It would be helpful to add a paragraph in the Discussion on how these results could be applicable to other model systems beyond zebrafish. Cell intrinsic rhythmogenesis is a popular concept in the field, and these results show an interesting and novel alternative. It would help to know if there is any experimental evidence suggesting such network-based propagation in other systems, invertebrates, or vertebrates.

      We have expanded a paragraph in the Discussion to address these questions. In particular, we highlight how a recent study of mouse locomotor circuits produced a model with similar key features (Komi et al., 2024). These authors made direct use of experimentally determined connectivity structure and cell-type distributions, which informed a model that produced purely network-based rhythmogenesis. We also point out that inhibition-dominated connectivity has been used for understanding oscillatory behavior in neural circuits outside the context of motor control (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). Finally, we address a study that used the cell-type specific connectivity within the C. Elegans locomotor circuit as the architecture for an artificial motor control system and found that the resulting system could more efficiently learn motor control tasks than general machine learning architectures (Bhattasali et al. 2022). Like our model, the Komi et al. and Bhattasali et al. models generate rhythm via structured connectivity motifs rather than via intracellular dynamical properties, suggesting that these may be a key mechanism underlying locomotion across species.

      Reviewer #1 (Recommendations for the authors):

      (1) Express this modeling construct in a simple biophysical model.

      See the new Results subsection titled “Robustness in a biophysical model.”

      (2) Please cite the classic models of Kopell, Ermentrout, Williams, Sigvardt etc., especially where you say "classic models".

      We have added relevant citations including the mentioned authors.

      (3) "Rhythmogenesis remain incompletely understood" changed to "Rhythmogenesis remains incompletely understood".

      We chose not to make this change since the ‘remain’ refers to the plural ‘core mechanisms’ not the singular ‘rhythmogenesis’.

      Reviewer #3 (Recommendations for the authors):

      (1) The figures are well made; however, it would help to add more details to the figure legends. For example, what neuron's firing rate is shown in Figure 1C? What is the red dot in 1B? Figures 3E,F,G: what is being plotted? Mean and SD? Blue dot in Figure 5C?

      All figure captions have been updated to enhance clarity and address these concerns.

      (2) A, B text missing in Figure 7.

      We have revised this figure and its caption; please see our response to Comment 3 above.

      (3) It would be nice to see the tonic drive pattern that is fed to the model for each case, along with the different firing rates in the figures. It would help understand how the tonic drive is changed to rhythmic activity.

      The tonic drive in the rate models is implemented as a constant excitatory input that is uniform across all units within the same speed-population. There is no patterning in time or location to this drive.

      References

      (1) Moneeza A Agha, Sandeep Kishore, and David L McLean. Cell-type-specific origins of locomotor rhythmicity at different speeds in larval zebrafish. eLife, July 2024

      (2) Nikhil Bhattasali, Anthony M Zador, and Tatiana Engel. Neural circuit architectural priors for embodied control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 12744–12759. Curran Associates, Inc., 2022.

      (3) Salif Komi, August Winther, Grace A. Houser, Roar Jakob Sørensen, Silas Dalum Larsen, Madelaine C. Adamssom Bonfils, Guanghui Li, and Rune W. Berg. Spatial and network principles behind neural generation of locomotion. bioRxiv, 2024

      (4) James M Murray and G Sean Escola. Learning multiple variable-speed sequences in striatum via cortical tutoring. eLife, 6:e26084, May 2017.

      (5) Alexei Samsonovich and Bruce L McNaughton. Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15):5900–5920, 1997.

      (6) K Zhang. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. Journal of Neuroscience, 16(6):2112–2126, 1996.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      We thank the Reviewers for their thorough attention to our paper and the interesting discussion about the findings. Before responding to more specific comments, here some general points we would like to clarify:

      (1) Ecological niche models are indeed correlative models, and we used them to highlight environmental factors associated with HPAI outbreaks within two host groups. We will further revise the terminology that could still unintentionally suggest causal inference. The few remaining ambiguities were mainly in the Discussion section, where our intent was to interpret the results in light of the broader scientific literature. Particularly, we will change the following expressions:

      -  “Which factors can explain…” to  “Which factors are associated with…” (line 75);

      -  “the environmental and anthropogenic factors influencing” to “the environmental and anthropogenic factors that are correlated with” (line 273);

      -  “underscoring the influence” to “underscoring the strong association” (line 282).

      (2) We respectfully disagree with the suggestion that an ecological niche modelling (ENM) approach is not appropriate for this work and the research question addressed therein. Ecological niche models are specifically designed to estimate the spatial distribution of the environmental suitability of species and pathogens, making them well suited to our research questions. In our study, we have also explicitly detailed the known limitations of ecological niche models in the Discussion section, in line with prior literature, to ensure their appropriate interpretation in the context of HPAI.

      (3) The environmental layers used in our models were restricted to those available at a global scale, as listed in Supplementary Information Resources S1 (https://github.com/sdellicour/h5nx\_risk\_mapping/blob/master/Scripts\_%26\_data/SI\_Resource\_S1.xlsx). Naturally, not all potentially relevant environmental factors could be included, but the selected layers are explicitly documented and only these were assessed for their importance. Despite this limitation, the performance metrics indicate that the models performed well, suggesting that the chosen covariates capture meaningful associations with HPAI occurrence at a global scale.

      Reviewer #1 (Public review):

      The authors aim to predict ecological suitability for transmission of highly pathogenic avian influenza (HPAI) using ecological niche models. This class of models identify correlations between the locations of species or disease detections and the environment. These correlations are then used to predict habitat suitability (in this work, ecological suitability for disease transmission) in locations where surveillance of the species or disease has not been conducted. The authors fit separate models for HPAI detections in wild birds and farmed birds, for two strains of HPAI (H5N1 and H5Nx) and for two time periods, pre- and post-2020. The authors also validate models fitted to disease occurrence data from pre-2020 using post-2020 occurrence data. I thank the authors for taking the time to respond to my initial review and I provide some follow-up below.

      Detailed comments:

      In my review, I asked the authors to clarify the meaning of "spillover" within the HPAI transmission cycle. This term is still not entirely clear: at lines 409-410, the authors use the term with reference to transmission between wild birds and farmed birds, as distinct to transmission between farmed birds. It is implied but not explicitly stated that "spillover" is relevant to the transmission cycle in farmed birds only. The sentence, "we developed separate ecological niche models for wild and domestic bird HPAI occurrences ..." could have been supported by a clear sentence describing the transmission cycle, to prime the reader for why two separate models were necessary.

      We respectfully disagree that the term “spillover” is unclear in the manuscript. In both the Methods and Discussion sections (lines 387-391 and 409-414), we explicitly define “spillover” as the introduction of HPAI viruses from wild birds into domestic poultry, and we distinguish this from secondary farm-to-farm transmission. Our use of separate ecological niche models for wild and domestic outbreaks reflects not only the distinction between primary spillover and secondary transmission, but also the fundamentally different ecological processes, surveillance systems, and management implications that shape outbreaks in these two groups. We will clarify this choice in the revised manuscript when introducing the separate models. Furthermore, on line 83, we will add “as these two groups are influenced by different ecological processes, surveillance biases, and management contexts”.

      I also queried the importance of (dead-end) mammalian infections to a model of the HPAI transmission risk, to which the authors responded: "While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds." I would argue that any infections, whether they are in dead-end or competent hosts, represent the presence of environmental conditions to support transmission so are certainly relevant to a niche model and therefore within scope. It is certainly understandable if the authors have not been able to access data of mammalian infections, but it is an oversight to dismiss these infections as irrelevant.

      We understand the Reviewer’s point, but our study was designed to model HPAI occurrence in avian hosts only. We therefore restricted our analysis to wild birds and domestic poultry, which represent the primary hosts for HPAI circulation and the focus of surveillance and control measures. While mammalian detections have been reported, they are outside the scope of this work.

      Correlative ecological niche models, including BRTs, learn relationships between occurrence data and covariate data to make predictions, irrespective of correlations between covariates. I am not convinced that the authors can make any "interpretation" (line 298) that the covariates that are most informative to their models have any "influence" (line 282) on their response variable. Indeed, the observation that "land-use and climatic predictors do not play an important role in the niche ecological models" (line 286), while "intensive chicken population density emerges as a significant predictor" (line 282) begs the question: from an operational perspective, is the best (e.g., most interpretable and quickest to generate) model of HPAI risk a map of poultry farming intensity?

      We agree that poultry density may partly reflect reporting bias, but we also assumed it a meaningful predictor of HPAI risk. Its importance in our models is therefore expected. Importantly, our BRT framework does more than reproduce poultry distribution: it captures non-linear relationships and interactions with other covariates, allowing a more nuanced characterisation of risk than a simple poultry density map. Note also that we distinguished in our models intensive and extensive chicken poultry density and duck density. Therefore, it is not a “map of poultry farming intensity”. 

      At line 282, we used the word “influence” while fully recognising that correlative models cannot establish causality. Indeed, in our analyses, “relative influence” refers to the importance metric produced by the BRT algorithm (Ridgeway, 2020), which measures correlative associations between environmental factors and outbreak occurrences. These scores are interpreted in light of the broader scientific literature, therefore our interpretations build on both our results and existing evidence, rather than on our models alone. However, in the next version of the paper, we will revise the sentence as: “underscoring the strong association of poultry farming practices with HPAI spread (Dhingra et al., 2016)”. 

      I have more significant concerns about the authors' treatment of sampling bias: "We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudo-absence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models." The authors have elected to ignore a fundamental feature of distribution modelling with occurrence-only data: if we include a source of sampling bias as a covariate and do not include it when we sample background data, then that covariate would appear to be correlated with presence. They acknowledge this later in their response to my review: "...assuming a sampling bias correlated with poultry density would result in reducing its effect as a risk factor." In other words, the apparent predictive capacity of poultry density is a function of how the authors have constructed the sampling bias for their models. A reader of the manuscript can reasonably ask the question: to what degree are is the model a model of HPAI transmission risk, and to what degree is the model a model of the observation process? The sentence at lines 474-477 is a helpful addition, however the preceding sentence, "Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry," (line 474) is included without acknowledgement of the flow-on consequence to one of the key findings of the manuscript, that "...intensive chicken population density emerges as a significant predictor..." (line 282). The additional context on the EMPRES-i dataset at line 475-476 ("the locations of outbreaks ... are often georeferenced using place name nomenclatures") is in conflict with the description of the dataset at line 407 ("precise location coordinates"). Ultimately, the choices that the authors have made are entirely defensible through a clear, concise description of model features and assumptions, and precise language to guide the reader through interpretation of results. I am not satisfied that this is provided in the revised manuscript.

      We thank the Reviewer for this important point. To address it, we compared model predictive performance and covariate relative influences obtained when pseudo-absences were weighted by poultry density versus human population density (Author response table 1). The results show that differences between the two approaches are marginal, both in predictive performance (ΔAUC ranging from -0.013 to +0.002) and in the ranking of key predictors (see below Author response images 1 and 2). For instance, intensive chicken density consistently emerged as an important predictor regardless of the bias layer used.

      Note: the comparison was conducted using a simplified BRT configuration for computational efficiency (fewer trees, fixed 5-fold random cross-validation, and standardised parameters). Therefore, absolute values of AUC and variable importance may differ slightly from those in the manuscript, but the relative ranking of predictors and the overall conclusions remain consistent.

      Given these small differences, we retained the approach using human population density. We agree that poultry density partly reflects surveillance bias as well as true epidemiological risk, and we will clarify this in the revised manuscript by noting that the predictive role of poultry density reflects both biological processes and surveillance systems. Furthermore, on line 289, we will add “We note, however, that intensive poultry density may reflect both surveillance intensity and epidemiological risk, and its predictive role in our models should be interpreted in light of both processes”.

      Author response table 1.

      Comparison of model predictive performances (AUC) between pseudo-absence sampling were weighted by poultry density and by human population density across host groups, virus types, and time periods. Differences in AUC values are shown as the value for poultry-weighted minus human-weighted pseudo-absences.

      Author response image 1.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for domestic bird outbreaks. Results are shown for four datasets: H5N1 (<2020), H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      Author response image 2.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for wild bird outbreaks. Results are shown for three datasets: H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      The authors have slightly misunderstood my comment on "extrapolation": I referred to "environmental extrapolation" in my review without being particularly explicit about my meaning. By "environmental extrapolation", I meant to ask whether the models were predicting to environments that are outside the extent of environments included in the occurrence data used in the manuscript. The authors appear to have understood this to be a comment on geographic extrapolation, or predicting to areas outside the geographic extent included in occurrence data, e.g.: "For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data" (lines 195-197). Is the model extrapolating in environmental space in these regions? This is unclear. I do not suggest that the authors should carry out further analysis, but the multivariate environmental similarly surface (MESS; see Elith et al., 2010) is a useful tool to visualise environmental extrapolation and aid model interpretation.

      On the subject of "extrapolation", I am also concerned by the additions at lines 362-370: "...our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions." The "discrepancy" cited here is a feature of the input dataset, a function of the observation distribution that should be captured in pseudo-absence data. The authors state that Kazakhstan and Central Asia are areas of interest, and that the environments in this region are outside the extent of environments captured in the occurrence dataset, although it is unclear whether "extrapolation" is informed by a quantitative tool like a MESS or judged by some other qualitative test. The authors then cite Australia as an example of a region with some predicted suitability but no HPAI outbreaks to date, however this discussion point is not linked to the idea that the presence of environmental conditions to support transmission need not imply the occurrence of transmission (as in the addition, "...spatial isolation may imply a lower risk of actual occurrences..." at line 214). Ultimately, the authors have not added any clear comment on model uncertainty (e.g., variation between replicated BRTs) as I suggested might be helpful to support their description of model predictions.

      Many thanks for the clarification. Indeed, we interpreted your previous comments in terms of geographic extrapolations. We thank the Reviewer for these observations. We will adjust the wording to further clarify that predictions of ecological suitability in areas with few or no reported outbreaks (e.g., Central Asia, Australia) are not model errors but expected extrapolations, since ecological suitability does not imply confirmed transmission (for instance, on Line 362: “our models extrapolate environmental suitability” will be changed to “Interestingly, our models extrapolate geographical”). These predictions indicate potential environments favorable to circulation if the virus were introduced.

      In our study, model uncertainty is formally assessed when comparing the predictive performances of our models (Fig. S3, Table S1), the relative influence (Table S3) and response curves (Fig. 2) associated with each environmental factor (Table S2). All the results confirming a good converge between these replicates. Finally, we indeed did not use a quantitative tool such as a MESS to assess extrapolation but did rely on qualitative interpretation of model outputs.

      All of my criticisms are, of course, applied with the understanding that niche modelling is imperfect for a disease like HPAI, and that data may be biased/incomplete, etc.: these caveats are common across the niche modelling literature. However, if language around the transmission cycle, the niche, and the interpretation of any of the models is imprecise, which I find it to be in the revised manuscript, it undermines all of the science that is presented in this work.

      We respectfully disagree with this comment. The scope of our study and the methods employed are clearly defined in the manuscript, and the limitations of ecological niche modelling in this context are explicitly acknowledged in the Discussion section. While we appreciate the Reviewer’s concern, the comment does not provide specific examples of unclear or imprecise language regarding the transmission cycle, niche, or interpretation of the models. Without such examples, it is difficult to identify further revisions that would improve clarity.

      Reviewer #2 (Public review):

      The geographic range of highly pathogenic avian influenza cases changed substantially around the period 2020, and there is much interest in understanding why. Since 2020 the pathogen irrupted in the Americas and the distribution in Asia changed dramatically. This study aimed to determine which spatial factors (environmental, agronomic and socio-economic) explain the change in numbers and locations of cases reported since 2020 (2020--2023). That's a causal question which they address by applying correlative environmental niche modelling (ENM) approach to the avian influenza case data before (2015--2020) and after 2020 (2020--2023) and separately for confirmed cases in wild and domestic birds. To address their questions they compare the outputs of the respective models, and those of the first global model of the HPAI niche published by Dhingra et al 2016.

      We do not agree with this comment. In the manuscript, it is well established that we are quantitatively assessing factors that are associated with occurrences data before and after 2020. We do not claim to determine the causality. One sentence of the Introduction section (lines 75-76) could be confusing, so we intend to modify it in the final revision of our manuscript. 

      ENM is a correlative approach useful for extrapolating understandings based on sparse geographically referenced observational data over un- or under-sampled areas with similar environmental characteristics in the form of a continuous map. In this case, because the selected covariates about land cover, use, population and environment are broadly available over the entire world, modelled associations between the response and those covariates can be projected (predicted) back to space in the form of a continuous map of the HPAI niche for the entire world.

      We fully agree with this assessment of ENM approaches.

      Strengths:

      The authors are clear about expected bias in the detection of cases, such geographic variation in surveillance effort (testing of symptomatic or dead wildlife, testing domestic flocks) and in general more detections near areas of higher human population density (because if a tree falls in a forest and there is no-one there, etc), and take steps to ameliorate those. The authors use boosted regression trees to implement the ENM, which typically feature among the best performing models for this application (also known as habitat suitability models). They ran replicate sets of the analysis for each of their model targets (wild/domestic x pathogen variant), which can help produce stable predictions. Their code and data is provided, though I did not verify that the work was reproducible.

      The paper can be read as a partial update to the first global model of H5Nx transmission by Dhingra and others published in 2016 and explicitly follows many methodological elements. Because they use the same covariate sets as used by Dhingra et al 2016 (including the comparisons of the performance of the sets in spatial cross-validation) and for both time periods of interest in the current work, comparison of model outputs is possible. The authors further facilitate those comparisons with clear graphics and supplementary analyses and presentation. The models can also be explored interactively at a weblink provided in text, though it would be good to see the model training data there too.

      The authors' comparison of ENM model outputs generated from the distinct HPAI case datasets is interesting and worthwhile, though for me, only as a response to differently framed research questions.

      Weaknesses:

      This well-presented and technically well-executed paper has one major weakness to my mind. I don't believe that ENM models were an appropriate tool to address their stated goal, which was to identify the factors that "explain" changing HPAI epidemiology.

      Here is how I understand and unpack that weakness:

      (1) Because of their fundamentally correlative nature, ENMs are not a strong candidate for exploring or inferring causal relationships.

      (2) Generating ENMs for a species whose distribution is undergoing broad scale range change is complicated and requires particular caution and nuance in interpretation (e.g., Elith et al, 2010, an important general assumption of environmental niche models is that the target species is at some kind of distributional equilibrium (at time scales relevant to the model application). In practice that means the species has had an opportunity to reach all suitable habitats and therefore its absence from some can be interpreted as either unfavourable environment or interactions with other species). Here data sets for the response (N5H1 or N5Hx case data in domestic or wild birds ) were divided into two periods; 2015--2020, and 2020--2023 based on the rationale that the geographic locations and host-species profile of cases detected in the latter period was suggestive of changed epidemiology. In comparing outputs from multiple ENMs for the same target from distinct time periods the authors are expertly working in, or even dancing around, what is a known grey area, and they need to make the necessary assumptions and caveats obvious to readers.

      We thank the Reviewer for this observation. First, we constrained pseudo-absence sampling to countries and regions where outbreaks had been reported, reducing the risk of interpreting non-affected areas as environmentally unsuitable. Second, we deliberately split the outbreak data into two periods (2015-2020 and 2020-2023) because we do not assume a single stable equilibrium across the full study timeframe. This division reflects known epidemiological changes around 2020 and allows each period to be modeled independently. Within each period, ENM outputs are interpreted as associations between outbreaks and covariates, not as equilibrium distributions. Finally, by testing prediction across periods, we assessed both niche stability and potential niche shifts. These clarifications will be added to the manuscript to make our assumptions and limitations explicit.

      Line 66, we will add: “Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution. To account for this, we analysed two distinct time periods (2015-2020 and 2020-2023).”

      Line 123, we will revise “These findings underscore the ability of pre-2020 models in forecasting the recent geographic distribution of ecological suitability for H5Nx and H5N1 occurrences” to “These results suggest that pre-2020 models captured broad patterns of suitability for H5Nx and H5N1 outbreaks, while post-2020 models provided a closer fit to the more recent epidemiological situation”.

      (3) To generate global prediction maps via ENM, only variables that exist at appropriate resolution over the desired area can be supplied as covariates. What processes could influence changing epidemiology of a pathogen and are their covariates that represent them? Introduction to a new geographic area (continent) with naive population, immunity in previously exposed populations, control measures to limit spread such as vaccination or destruction of vulnerable populations or flocks? Might those control measures be more or less likely depending on the country as a function of its resources and governance? There aren't globally available datasets that speak to those factors, so the question is not why were they omitted but rather was the authors decision to choose ENMs given their question justified? How valuable are insights based on patterns of correlation change when considering different temporal sets of HPAI cases in relation to a common and somewhat anachronistic set of covariates?

      We agree that the ecological niche models trained in our study are limited to environmental and host factors, as described in the Methods section with the selection of predictors. While such models cannot capture causality or represent processes such as immunity, control measures, or governance, they remain a useful tool for identifying broad associations between outbreak occurrence and environmental context. Our study cannot infer the full mechanisms driving changes in HPAI epidemiology, but it does provide a globally consistent framework to examine how associations with available covariates vary across time periods.

      (4) In general the study is somewhat incoherent with respect to time. Though the case data come from different time periods, each response dataset was modelled separately using exactly the same covariate dataset that predated both sets. That decision should be understood as a strong assumption on the part of the authors that conditions the interpretation: the world (as represented by the covariate set) is immutable, so the model has to return different correlative associations between the case data and the covariates to explain the new data. While the world represented by the selected covariates \*may\* be relatively stable (could be statistically confirmed), what about the world not represented by the covariates (see point 3)?

      We used the same covariate layers for both periods, which indeed assumes that these environmental and host factors are relatively stable at the global scale over the short timeframe considered. We believe this assumption is reasonable, as poultry density, land cover, and climate baselines do not change drastically between 2015 and 2023 at the resolution of our analysis. We agree, however, that unmeasured processes such as control measures, immunity, or governance may have changed during this time and are not captured by our covariates.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      - Line 400-401: "over the 2003-2016 periods" has an extra "s"; "two host species" (with reference to wild and domestic birds) would be more precise as "two host groups".

      - Remove comma line 404

      Many thanks for these comments, we have modified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      Most of my work this round is encapsulated in the public part of the review.

      The authors responded positively to the review efforts from the previous round, but I was underwhelmed with the changes to the text that resulted. Particularly in regard to limiting assumptions - the way that they augmented the text to refer to limitations raised in review downplayed the importance of the assumptions they've made. So they acknowledge the significance of the limitation in their rejoinder, but in the amended text merely note the limitation without giving any sense of what it means for their interpretation of the findings of this study.

      The abstract and findings are essentially unchanged from the previous draft.

      I still feel the near causal statements of interpretation about the covariates are concerning. These models really are not a good candidate for supporting the inference that they are making and there seem to be very strong arguments in favour of adding covariates that are not globally available.

      We never claimed causal interpretation, and we have consistently framed our analyses in terms of associations rather than mechanisms. We acknowledge that one phrasing in the research questions (“Which factors can explain…”) could be misinterpreted, and we are correcting this in the revised version to read “Which factors are associated with…”. Our approach follows standard ecological niche modelling practice, which identifies statistical associations between occurrence data and covariates. As noted in the Discussion section, these associations should not be interpreted as direct causal mechanisms. Finally, all interpretive points in the manuscript are supported by published literature, and we consider this framing both appropriate and consistent with best practice in ecological niche modelling (ENM) studies.

      We assessed predictor contributions using the “relative influence” metric, the terminology reported by the R package “gbm” (Ridgeway, 2020). This metric quantifies the contribution of each variable to model fit across all trees, rescaled to sum to 100%, and should be interpreted as an association rather than a causal effect.

      L65-66 The general difficulty of interpreting ENM output with range-shifting species should be cited here to alert readers that they should not blithely attempt what follows at home.

      I believe that their analysis is interesting and technically very well executed, so it has been a disappointment and hard work to write this assessment. My rough-cut last paragraph of a reframed intro would go something like - there are many reasons in the literature not to do what we are about to do, but here's why we think it can be instructive and informative, within certain guardrails.

      To acknowledge this comment and the previous one, we revised lines 65-66 to: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses. Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution.”

      We respectfully disagree with the Reviewer’s statement that “there are many reasons in the literature not to do what we are about to do”. All modeling approaches, including mechanistic ones, have limitations, and the literature is clear on both the strengths and constraints of ecological niche models. Our manuscript openly acknowledges these limits and frames our findings accordingly. We therefore believe that our use of an ENM approach is justified and contributes valuable insights within these well-defined boundaries.

      Reference: Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package. Update, 1(1), 2007.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public review):

      Summary:

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that age-related changes aside from synaptopathy are responsible for the age-related decline in discrimination.

      Strengths:

      (1) The rationale and hypothesis are well-motivated and clearly presented.

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function.

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.

      Weaknesses:

      (1) I have concerns that the gerbils may not have been performing the behavioral task using temporal fine structure information.

      Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. However, gerbil auditory filters are thought to be broader than those in human. In the revised version of the manuscript, the authors provide modelling results suggesting that the excitation patterns were discriminable for the 4F0 conditions, but may not have been for the 8F0 conditions. These results provide some reassurance that the 8F0 discriminations were dependent on temporal cues, but the description of the model lacks detail. Also, the authors state that "thus, for these two conditions with harmonic number N of 8 the gerbils cannot rely on differences in the excitation patterns but must solve the task by comparing the temporal fine structure." This is too strong. Pulsed tone intensity difference limens (the reference used for establishing whether or not the excitation pattern cues were usable) may not be directly comparable to profile-analysis-like conditions, and it has been argued that frequency discrimination may be more sensitive to excitation pattern cues than predicted from a simple comparison to intensity difference limens (Micheyl et al. 2013, https://doi.org/10.1371/journal.pcbi.1003336

      We can assume that our conclusions based on the excitation patterns are adequate when putting gerbil auditory filter data, frequency difference limens and intensity difference limens together into perspective. Kittel et al. (2002) observed an about factor 2 larger auditory-filter bandwidth in the gerbil than in humans reducing the number of independent frequency channels in the analysis of excitation patterns. The gerbil frequency-difference limen for pure tones being an indicator for the sensitivity to make use of excitation patterns is more than an order of magnitude larger than the corresponding human frequency difference limen (Klinge and Klump 2009, https://doi.org/10.1121/1.3021315). Finally, the gerbil intensity-difference limen of 2.8 dB observed for 1-kHz pure tones is considerably larger than the 0.75 dB observed for humans in the same study (Sinnott et al. 1992). Thus, taken together these lines of evidence indicate that our conclusions regarding the potential use of excitation patterns are not too strong.

      I'm also somewhat concerned that the masking noise used in the present study was too low in level to mask cochlear distortion products. Based on their excitation pattern modelling, the authors state (without citation) that "since the level of excitation produced by the pink noise is less than 30 dB below that produced by the complex tones, distortion products will be masked." The basis for this claim is not clear. In human, distortion products may be only ~20 dB below the levels of the primaries (referenced to an external sound masker / canceller, which is appropriate, assuming that the modelling reported in the present paper did not include middle-ear effects; see Norman-Haignere and McDermott, 2016, doi: 10.1016/j.neuroimage.2016.01.050). Oxenham et al. (2009, doi: 10.1121/1.3089220) provide further cautionary evidence on the potential use of distortion product cues when the background noise level is too low (in their case the relative level of the noise in the compromised condition was only a little below that used in the present study). The masking level used in the present study may have been sufficient, but it would be useful to have some further reassurance on this point.

      In the method section, we provide the citation for estimating the size of the distortion products and the estimated signal-to-noise ratio making the basis for our estimates clear.

      We consulted Oxenham et al. (2009, doi: 10.1121/1.3089220) who suggested that distortion products may have been used in human subjects. However, in Fig. 1 of their paper, they convincingly demonstrate that even for humans that have more narrow auditory filters than gerbils, spectral cues cannot be used to evaluate the frequency shift in harmonic complex tones. We are confident that the same limitation applies to gerbils that have wider auditory filters than humans and a lower ability to use spectral cues as indicated by their higher frequency-difference limens and intensity-difference limens compared to humans.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other age-related deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model.

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age.

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript.

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-z-ratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.

      [Update: The revised manuscript has addressed these issues]

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.

      [Update: The issue of threshold shifts with aging gerbils is still unresolved in my opinion. From the revised manuscript, it appears that aged gerbils have a 36dB shift in thresholds. While the revised manuscript provides convincing evidence that these threshold shifts do not affect the auditory nerve tuning properties, the behavioral paradigm was still presented at the same sound level for young and aged animals. But a potential 36 dB change in sensation level may affect behavioral results. The authors may consider adding thresholds as covariates in analyses or present any evidence that behavioral thresholds are plateaued along that 30dB range].

      Since we do not have behavioural detection thresholds from our individual animals, only CAP thresholds that represent the auditory-nerve data and cannot be translated to behavioural thresholds directly, we want to refrain from using these indirect measures as covariates in the present analysis. In addition, the study by Hamann et al. (2002, https://doi.org/10.1016/S0378-5955(02)00454-9) indicates that age-related behavioural threshold increases are smaller than threshold increases obtained from auditory brainstem response measurements. Finally, statistical analyses on very small samples can be unreliable due to problems of power, generalisability, and susceptibility to outliers.

      Moore and Sek (2009) in their paper on the TFS1 test pointed out that the effect of signal level on the TFS1 threshold in normal hearing human subjects was small when the signal-to-noise ratio between the broadband masking noise and the complex tone was kept constant. Furthermore, the masking noise will raise the thresholds of normal hearing gerbils and old gerbils with an audibility threshold increase to about the same signal-to-noise ratio. Thus, as long as the signal remains audible to the behaviourally tested gerbil which can be expected at an overall signal level of 68 dB SPL, we expect little effect of raised audibility thresholds on the TFS1 threshold. The lack of temporal processing deficits in the auditory-nerve fibers of old, mildly hearing impaired gerbils compared to those in normal hearing young adult gerbils further strengthens this argument.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.

      [Update: The revised manuscript sufficiently addresses these issues, with the caveat of hearing threshold changes affecting behavioral thresholds mentioned above].

      As we argued above, an audibility threshold increase in the old gerbils is unlikely to explain the raised TFS1 thresholds in the old gerbils.

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age.

      [Update: The revised manuscript has addressed these issues]

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.

      [Update: The revised manuscript has addressed these issues]

      Reviewer #3 (Recommendations for the authors):

      Thank you for your revisions. They largely address most of my initial concerns. The issue of threshold shifts potentially affecting behavioral thresholds still remains unresolved in my opinion. The new data about unaltered tuning curves is convincing that the auditory nerve fiber recordings are unaffected by threshold shifts. But am I correct in my understanding that the threshold shift with age was 36 dB relative to the young (L168)? If so, wouldn't the fact that behavior was performed at 68 dB SPL regardless of group affect the behavioral thresholds with age? Is there any additional evidence that suggests that behavioral performance plateaus along that ~30dB range that the authors could include to strengthen this claim?

      In our response above to reviewer #3 and to reviewer #2 we provided additional arguments why we think that an audibility threshold increase in old gerbils cannot explain their compromised TFS1 thresholds.


      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review)  

      Summary:  

      The authors investigate the effects of aging on auditory system performance in understanding temporal fine structure (TFS), using both behavioral assessments and physiological recordings from the auditory periphery, specifically at the level of the auditory nerve. This dual approach aims to enhance understanding of the mechanisms underlying observed behavioral outcomes. The results indicate that aged animals exhibit deficits in behavioral tasks for distinguishing between harmonic and inharmonic sounds, which is a standard test for TFS coding. However, neural responses at the auditory nerve level do not show significant differences when compared to those in young, normalhearing animals. The authors suggest that these behavioral deficits in aged animals are likely attributable to dysfunctions in the central auditory system, potentially as a consequence of aging. To further investigate this hypothesis, the study includes an animal group with selective synaptic loss between inner hair cells and auditory nerve fibers, a condition known as cochlear synaptopathy (CS).CS is a pathology associated with aging and is thought to be an early indicator of hearing impairment. Interestingly, animals with selective CS showed physiological and behavioral TFS coding similar to that of the young normal-hearing group, contrasting with the aged group's deficits. Despite histological evidence of significant synaptic loss in the CS group, the study concludes that CS does not appear to affect TFS coding, either behaviorally or physiologically.  

      We agree with the reviewer’s summary.

      Strengths:  

      This study addresses a critical health concern, enhancing our understanding of mechanisms underlying age-related difficulties in speech intelligibility, even when audiometric thresholds are within normal limits. A major strength of this work is the comprehensive approach, integrating behavioral assessments, auditory nerve (AN) physiology, and histology within the same animal subjects. This approach enhances understanding of the mechanisms underlying the behavioral outcomes and provides confidence in the actual occurrence of synapse loss and its effects. The study carefully manages controlled conditions by including five distinct groups: young normal-hearing animals, aged animals, animals with CS induced through low and high doses, and a sham surgery group. This careful setup strengthens the study's reliability and allows for meaningful comparisons across conditions. Overall, the manuscript is well-structured, with clear and accessible writing that facilitates comprehension of complex concepts.

      Weaknesses:

      The stimulus and task employed in this study are very helpful for behavioral research, and using the same stimulus setup for physiology is advantageous for mechanistic comparisons. However, I have some concerns about the limitations in auditory nerve (AN) physiology. Due to practical constraints, it is not feasible to record from a large enough population of fibers that covers a full range of best frequencies (BFs) and spontaneous rates (SRs) within each animal. This raises questions about how representative the physiological data are for understanding the mechanism in behavioral data. I am curious about the authors' interpretation of how this stimulus setup might influence results compared to methods used by Kale and Heinz (2010), who adjusted harmonic frequencies based on the characteristic frequency (CF) of recorded units. While, the harmonic frequencies in this study are fixed across all CFs, meaning that many AN fibers may not be tuned closely to the stimulus frequencies. If units are not responsive to the stimulus further clarification on detecting mistuning and phase locking to TFS effects within this setup would be valuable. Since the harmonic frequencies in this study are fixed across all CFs, this means that many AN fibers may not be tuned closely to the stimulus frequencies, adding sampling variability to the results.

      We chose the stimuli for the AN recordings to be identical to the stimuli used in the behavioral evaluation of the perceptual sensitivity. Only with this approach can we directly compare the response of the population of AN fibers with perception measured in behavior.

      The stimuli are complex, i.e., comprise of many frequency components AND were presented at 68 dB SPL. Thus, the stimuli excite a given fiber within a large portion of the fiber’s receptive field. Furthermore, during recordings, we assured ourselves that fibers responded to the stimuli by audiovisual control. Otherwise it would have cost valuable recording time to record from a nonresponsive AN fiber.

      Given the limited number of units per condition-sometimes as few as three for certain conditions - I wonder if CF-dependent variability might impact the results of the AN data in this study and discussing this factor can help with better understanding the results. While the use of the same stimuli for both behavioral and physiological recordings is understandable, a discussion on how this choice affects interpretation would be beneficial. In addition a 60 dB stimulus could saturate high spontaneous rate (HSR) AN fibers, influencing neural coding and phase-locking to TFS. Potentially separating SR groups, could help address these issues and improve interpretive clarity.  

      A deeper discussion on the role of fiber spontaneous rate could also enhance the study. How might considering SR groups affect AN results related to TFS coding? While some statistical measures are included in the supplement, a more detailed discussion in the main text could help in interpretation.  We do not think that it will be necessary to conduct any statistical analysis in addition to that already reported in the supplement.  

      We considered moving some supplementary information back into the main manuscript but decided against it. Our single-unit sample was not sufficient, i.e. not all subpopulations of auditory-nerve fibers were sufficiently sampled for all animal treatment groups, to conclusively resolve every aspect that may be interesting to explore. The power of our approach lies in the direct linkage of several levels of investigation – cochlear synaptic morphology, single-unit representation and behavioral performance – and, in the main manuscript, we focus on the core question of synaptopathy and its relation to temporal fine structure perception. This is now spelled out clearly in lines 197 - 203 of the main manuscript.  

      Although Figure S2 indicates no change in median SR, the high-dose treatment group lacks LSR fibers, suggesting a different distribution based on SR for different animal groups, as seen in similar studies on other species. A histogram of these results would be informative, as LSR fiber loss with CS-whether induced by ouabain in gerbils or noise in other animals-is well documented (e.g., Furman et al., 2013).  

      Figure S2 was revised to avoid overlap of data points and show the distributions more clearly. Furthermore, the sample sizes for LSR and HSR fibers are now provided separately.

      Although ouabain effects on gerbils have been explored in previous studies, since these data already seems to be recorded for the animal in this study, a brief description of changes in auditory brainstem response (ABR) thresholds, wave 1 amplitudes, and tuning curves for animals with cochlear synaptopathy (CS) in this study would be beneficial. This would confirm that ouabain selectively affects synapses without impacting outer hair cells (OHCs). For aged animals, since ABR measurements were taken, comparing hearing differences between normal and aged groups could provide insights into the pathologies besides CS in aged animals. Additionally, examining subject variability in treatment effects on hearing and how this correlates with behavior and physiology would yield valuable insights. If limited space maybe a brief clarification or inclusion in supplementary could be good enough.  

      We thank the reviewer for this constructive suggestion. The requested data were added in a new section of the Results, entitled “Threshold sensitivity and frequency tuning were not affected by the synapse loss.” (lines 150 – 174). Our young-adult, ouabain-treated gerbils showed no significant elevations of CAP thresholds and their neural tuning was normal. Old gerbils showed the typical threshold losses for individuals of comparable age, and normal neural tuning, confirming previous reports. Thus, there was no evidence for relevant OHC impairments in any of our animal groups.   

      Another suggestion is to discuss the potential role of MOC efferent system and effect of anesthesia in reducing efferent effects in AN recordings. This is particularly relevant for aged animals, as CS might affect LSR fibers, potentially disrupting the medial olivocochlear (MOC) efferent pathway. Anesthesia could lessen MOC activity in both young and aged animals, potentially masking efferent effects that might be present in behavioral tasks. Young gerbils with functional efferent systems might perform better behaviorally, while aged gerbils with impaired MOC function due to CS might lack this advantage. A brief discussion on this aspect could potentially enhance mechanistic insights.  

      Thank you for this suggestion. The potential role of olivocochlear efferents is now discussed in lines 597 - 613.

      Lastly, although synapse counts did not differ between the low-dose treatment and NH I sham groups, separating these groups rather than combining them with the sham might reveal differences in behavior or AN results, particularly regarding the significance of differences between aged/treatment groups and the young normal-hearing group.  

      For maximizing statistical power, we combined those groups in the statistical analysis. These two groups did not differ in synapse number, threshold sensitivity or neural tuning bandwidths.

      Reviewer #2 (Public review):

      Summary:  

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that agerelated changes aside from synaptopathy are responsible for the age-related decline in discrimination. 

      We agree with the reviewer’s summary.

      Strengths: 

      (1) The rationale and hypothesis are well-motivated and clearly presented. 

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function. 

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.  

      Weaknesses: 

      (1) My main concern is that the stimuli may not have been appropriate for assessing neural temporal coding behaviorally. Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. By my calculations, the masking noise used in the present study was also considerably lower in level relative to the harmonic complex than that used in the human studies. These factors may have allowed the animals to perform the task using cues based on the pattern of activity across the neural array (excitation pattern cues), rather than cues related to temporal neural coding. The authors show that mean neural driven rate did not change with frequency shift, but I don't understand the relevance of this. It is the change in response of individual fibers with characteristic frequencies near the lowest audible harmonic that is important here.  

      The auditory filter bandwidth of the gerbil is about double that of human subjects. Because of this, the masking noise has a larger overall level than in the human studies in the filter, prohibiting the use of distortion products. The larger auditory filter bandwidth precludes that the gerbils can use excitation patterns, especially in the condition with a center frequency of 1600 Hz and a fundamental of 200 Hz and in the condition with a center frequency of 3200 Hz and a fundamental of 400 Hz. In the condition with a center frequency of 1600 Hz and a fundamental of 400 Hz, it is possible that excitation patterns are exploited. We have now added  modeling of the excitation patterns, and a new figure showing their change at the gerbils’ perception threshold, in the discussion of the revised version (lines 440 - 446 and Fig. 8).

      The case against excitation pattern cues needs to be better made in the Discussion. It could be that gerbil frequency selectivity is broad enough for this not to be an issue, but more detail needs to be provided to make this argument. The authors should consider what is the lowest audible harmonic in each case for their stimuli, given the level of each harmonic and the level of the pink noise. Even for the 8F0 center frequency, the lowest audible harmonic may be as low as the 4th (possibly even the 3rd). In human, harmonics are thought to be resolvable by the cochlea up to at least the 8th.  

      This issue is now covered in the discussion, see response to the previous point.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human). This should be discussed in the manuscript. 

      We agree that our results apply to moderate synaptopathy, which predominantly characterizes early stages of hearing loss or aged individuals without confounding noise-induced cochlear damage. This is now discussed in lines 486 – 498.

      It would be informative to provide synapse counts separately for the animals who were tested behaviorally, to confirm that the pattern of loss across the group was the same as for the larger sample.  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.  

      The results for the three old subjects differed significantly from those of young subjects and young ouabain-treated subjects. This indicates a sufficient statistical power, since otherwise no significant differences would be observed.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other agerelated deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model. 

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age. 

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.  

      We agree with the reviewer’s summary.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript. 

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in Figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-zratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.  

      As the reviewer points out, our sample from the group treated with a high concentration of ouabain showed very few low-spontaneous-rate auditory-nerve fibers, as expected from previous work. However, this was also true, e.g., for our sample from sham-operated animals, and may thus well reflect a sampling bias. We are therefore reluctant to attach much significance to these data distributions. We now point out more clearly the limitations of our auditory-nerve sample for the exploration of  interesting questions beyond our core research aim (see also response to Reviewer 1 above).  

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.  

      Unfortunately, we did not obtain behavioral thresholds that could be used here. We want to point out that the TFS 1 stimuli had an overall level of 68 dB SPL, and the pink noise masker would have increased the threshold more than expected from the moderate, age-related hearing loss in quiet. Thus, the masked thresholds for all gerbil groups are likely similar and should have no effect on the behavioral results.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.  

      Even in the group of gerbils with the lowest sensitivity, for the condition 400/1600 the animals achieved a d’ of on average above 1. Furthermore, stimuli were well above threshold and audible, even when no discrimination could be observed. Finally, as explained in the methods, different stimulus conditions were interleaved in each session, providing stimuli that were easy to discriminate together with those being difficult to discriminate. This approach ensures that the gerbils were under stimulus control, meaning properly trained to perform the task. Thus, an inability to discriminate does not indicate a lack of proper training.  

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age. 

      A similar point was made by Reviewer #1. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.  

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.  

      This is an interesting suggestion that we now explore in the revision of the manuscript. Reaction times can be used as a proxy for listening effort and were recorded for all responses. The the new analysis now reported in lines 378 - 396 compared young-adult control gerbils with young-adult gerbils that had been treated with the high concentration of ouabain. No differences in response latencies was found, indicating that listening effort did not change with synapse loss.  

      Reviewer #1 (Recommendations for the authors): 

      Figure 2: The y-axis labeled as "Frequency" is potentially misleading since there are additional frequency values on the right side of the panels. It would be helpful to clarify more in the caption what these right-side frequency values represent. Additionally, the legend could be positioned more effectively for clarity.

      Thank you for your suggestion. The axis label was rephrased.

      Figure 7: This figure is a bit unclear, as it appears to show two sets of gerbil data at 1500 Hz, yet the difference between them is not explained.  

      We added the following text to the figure legend: „The higher and lower thresholds shown for the gerbil data reflect thresholds at  fc of 1600 Hz for fundamentals f0 of 200 Hz and 400 Hz, respectively.“

      Maybe a short description of fmax that is used in Figure 4 could help or at least point to supplementary for finding the definition.  

      We thank the reviewer for pointing out this typo/inaccuracy. The correct terminology in line with the remainder of the manuscript is “fmaxpeak”. We corrected the caption of figure 5 (previously figure 4) and added the reference pointing to figure 11 (previously figure 9), which explains the terms.

      I couldn't find information about the possible availability of data. 

      The auditory-nerve recordings reported in this paper are part of a larger study of single-unit auditorynerve responses in gerbils, formally described and published by Heeringa (2024) Single-unit data for sensory neuroscience: Responses from the auditory nerve of young-adult and aging gerbils. Scientific Data 11:411, https://doi.org/10.1038/s41597-024-03259-3. As soon as the Version of Record will be submitted, the raw single-unit data can be accessed directly through the following link:  https://doi.org/10.5061/dryad.qv9s4mwn4. The data that are presented in the figures of the present manuscript and were statistically analyzed are uploaded to the Zenodo repository (https://doi.org/10.5281/zenodo.15546625).  

      Reviewer #2 (Recommendations for the authors): 

      L22. The term "hidden hearing loss" is used in many different ways in the literature, from being synonymous with cochlear synaptopathy, to being a description of any listening difficulties that are not accounted for by the audiogram (for which there are many other / older terms). The original usage was much more narrow than your definition here. It is not correct that Schaette and McAlpine defined HHL in the broad sense, as you imply. I suggest you avoid the term to prevent further confusion.  

      We eliminated the term hidden hearing loss.

      L43. SNHL is undefined.

      Thank you for catching that. The term is now spelled out.

      L64. "whether" -> "that"  

      We corrected this issue.

      L102. It would be informative to see the synapse counts (across groups) for the animals tested in the behavioral part of the study. Did these vary between groups in the same way?  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      L108. How many tests were considered in the Bonferroni correction? Did this cover all reported tests in the paper?  

      The comparisons of synapse numbers between treatment groups were done with full Bonferroni correction, as in the other tests involving posthoc pair-wise comparisons after an ANOVA.

      Figure 1 and 6 captions. Explain meaning of * and ** (criteria values).  

      The information was added to the figure legends of now Figs. 1 and 7. 

      L139. I don't follow the argument - the mean driven rate is not important. It is the rate at individual CFs and how that changes with frequency shift that provides the cue.

      L142. I don't follow - individual driven rates might have been a cue (some going up, some down, as frequency was shifted).  

      Yes, theoretically it is possible that the spectral pattern of driven rates (i.e., excitation pattern) can be specifically used for profile analysis and subsequently as a strong cue for discriminating the TFS1 stimuli. In order to shed some light on this question with regard to the actual stimuli used in this study, we added a comprehensive figure showing simulated excitation patterns (figure 8). The excitation patterns were generated with a gammatone filter bank and auditory filter bandwidths appropriate for gerbils (Kittel et al. 2002). The simulated excitation patterns allow to draw some at least semi-quantitative conclusions about the possibility of profile analysis: 1. In the 200/1600 Hz and 400/3200 Hz conditions (i.e., harmonic number of fc is 8), the difference between all inharmonic excitation patterns and the harmonic reference excitation pattern is far below the threshold for intensity discrimination (Sinnott et al. 1992). 2. In the same conditions, the statistics of the pink noise make excitation patterns differences at or beyond the filter slopes (on both high and low frequency limits) useless for frequency shift discrimination. 3. In the 400/1600 Hz condition (i.e., harmonic number of fc is 4), there is a non-negligible possibility that excitation pattern differences were a main cue for discrimination. All of these conclusions are compatible with the results of our study.

      L193. Is this p-value Bonferroni corrected across the whole study? If not, the finding could well be spurious given the number of tests reported.  

      Yes, it is Bonferroni corrected

      L330. TFS is already defined.  

      L346. AN is already defined.  

      L408. "temporal fine structure" -> "TFS"  

      It was a deliberate decision to define these terms again in the Discussion, for readers who prefer to skip most of the detailed Results. 

      L364-366. This argument is somewhat misleading. Cochlear resolvability largely depends on the harmonic spacing (i.e., F0) relative to harmonic frequency (in other words, on harmonic rank). Marmel et al. (2015) and Moore and Sek (2009) used a center frequency (at least) 11 times F0. Here, the center frequency was only 4 or 8 times F0. In human, this would not be sufficient to eliminate excitation pattern cues.  

      We have now included results from modeling the excitation patterns in the discussion with a new figure demonstrating that at a center frequency of 8 times F0, excitation patterns provide no useful cue while this is a possibility at  a center frequency of 4 times F0 (Fig. 8, lines 440 - 446).

      L541. Was that a spectrum level of 20 dB SPL (level per 1-Hz wide band) at 1 kHz? Need to clarify.  

      The power spectral density of the pink noise at 1 kHz (i.e., the level in a 1 Hz wide band centered at 1 kHz) was 13.3 dB SPL. The total level of the pink noise (including edge filters at 100 Hz and 11 kHz) was 50 dB SPL.

      L919. So was the correction applied across only the tests within each ANOVA? Don't you need to control the study-wise error rate (across all primary tests) to avoid spurious findings?  

      We added information about the family-wise error rate (line 1077 - 1078). Since the ANOVAs tested different specific research questions, we do not think that we need to control the study-wise error rate.

      Reviewer #3 (Recommendations for the authors): 

      There was no difference in TFS sensitivity in the AN fiber activity across all the groups. Potential deficits with age were only sound in the behavioral paradigm. Given that, it might make it clearer to specify that the deficits or lack thereof are in behavior, in multiple instances in the manuscript where it says synaptopathy showed no decline in TFS sensitivity (For example Line 342-344).  

      We carefully went through the entire text and clarified a couple more instances.

      L353 - this statement is a bit too strong. It implies causality when there is only a co-occurrence of increased f0 representation and age-related behavioral deficits in TFS1 task.  

      The statement was rephrased as “Thus, cue representation may be associated with the perceptual deficits, but not reduced synapse numbers, as originally proposed.”

      L465-467 - while this may be true, I think it is hard to say this with the current dataset where only AN fibers are being recorded from. I don't think we can say anything about afferent central mechanisms with this data set.  

      We agree. However, we refer here to published data on central inhibition to provide a possible explanation. 

      Hearing thresholds with ABRs are mentioned in the methods, but that data is not presented anywhere. Would be nice to see hearing thresholds across the various groups to account or discount outer hair cell dysfunction. 

      This important point was made repeatedly and we thank the Reviewers for it. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      his valuable study presents a theoretical model of how punctuated mutations influence multistep adaptation, supported by empirical evidence from some TCGA cancer cohorts. This solid model is noteworthy for cancer researchers as it points to the case for possible punctuated evolution rather than gradual genomic change. However, the parametrization and systematic evaluation of the theoretical framework in the context of tumor evolution remain incomplete, and alternative explanations for the empirical observations are still plausible.

      We thank the editor and the reviewers for their thorough engagement with our work. The reviewers’ comments have drawn our attention to several important points that we have addressed in the updated version. We believe that these modifications have substantially improved our paper.

      There were two major themes in the reviewers’ suggestions for improvement. The first was that we should demonstrate more concretely how the results in the theoretical/stylized modelling parts of our paper quantitatively relate to dynamics in cancer.

      To this end, we have now included a comprehensive quantification of the effect sizes of our results across large and biologically-relevant parameter ranges. Specifically, following reviewer 1’s suggestion to give more prominence to the branching process, we have added two figures (Fig S3-S4) quantifying the likelihood of multi-step adaptation in a branching process for a large range of mutation rates and birth-death ratios. Formulating our results in terms of birth-death ratios also allowed us to provide better intuition regarding how our results manifest in models with constant population size vs models of growing populations. In particular, the added figure (Fig S3) highlights that the effect size of temporal clustering on the probability of successful 2-step adaptation is very sensitive to the probability that the lineage of the first mutant would go extinct if it did not acquire a second mutation. As a result, the phenomenon we describe is biologically likely to be most effective in those phases during tumor evolution in which tumor growth is constrained. This important pattern had not been described sufficiently clearly in the initial version of our manuscript, and we thank both reviewers for their suggestions to make these improvements.

      The second major theme in the reviewers’ suggestions was focused on how we relate our theoretical findings to readouts in genomic data, with both reviewers pointing to potential alternative explanations for the empirical patterns we describe.

      We have now extended our empirical analyses following some of the reviewers’ suggestions. Specifically, we have included analyses investigating how the contribution of reactive oxygen species (ROS)-related mutation signatures correlates with our proxies for multi-step adaptation; and we have included robustness checks in which we use Spearman instead of Pearson correlations. Moreover, we have included more discussion on potential confounds and the assumptions going into our empirical analyses as well as the challenges in empirically identifying the phenomena we describe.

      Below, we respond in detail to the individual comments made by each reviewer.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Grasper et al. present a combined analysis of the role of temporal mutagenesis in cancer, which includes both theoretical investigation and empirical analysis of point mutations in TCGA cancer patient cohorts. They find that temporally elevated mutation rates contribute to cancer fitness by allowing fast adaptation when the fitness drops (due to previous deleterious mutations). This may be relevant in the case of tumor suppressor genes (TSG), which follow the 2-hit hypothesis (i.e., biallelic 2 mutations are necessary to deactivate TS), and in cases where temporal mutagenesis occurs (e.g., high APOBEC, ROS). They provide evidence that this scenario is likely to occur in patients with some cancer types. This is an interesting and potentially important result that merits the attention of the target audience. Nonetheless, I have some questions (detailed below) regarding the design of the study, the tools and parametrization of the theoretical analysis, and the empirical analysis, which I think, if addressed, would make the paper more solid and the conclusion more substantiated.

      Strengths:

      Combined theoretical investigation with empirical analysis of cancer patients.

      Weaknesses:

      Parametrization and systematic investigation of theoretical tools and their relevance to tumor evolution.

      We sincerely thank Reviewer 1 for their comments. As communicated in more detail in the point-by-point replies to the “Recommendations for the authors”, we have revised the paper to address these comments in various ways. To summarize, Reviewer 1 asked for (1) more comprehensive analyses of the parameter space, especially in ranges of small fitness effects and low mutation rates; (2) additional clarifications on details of mechanisms described in the manuscript; and (3) suggested further robustness checks to our empirical analyses. We have addressed these points as follows: we have added detailed analyses of dynamics and effect sizes for branching processes (see Sections SI2 and SI3 in the Supplementary Information, as well as Figures S3 and S4). As suggested, these additions provide characterizations of effect sizes in biologically relevant parameter ranges (low mutation rates and smaller fitness effect sizes), and extend our descriptions to processes with dynamically changing population sizes. Moreover, we have added further clarifications at suggested points in the manuscript, e.g. to elaborate on the non-monotonicities in Fig 3. Lastly, we have undertaken robustness checks using Spearman rather than Pearson correlation coefficients to quantify relations between TSG deactivation and APOBEC signature contribution, and have performed analyses investigating dynamics of reactive oxygen species-associated mutagenesis instead of APOBEC.

      Reviewer #2 (Public review):

      This work presents theoretical results concerning the effect of punctuated mutation on multistep adaptation and empirical evidence for that effect in cancer. The empirical results seem to agree with the theoretical predictions. However, it is not clear how strong the effect should be on theoretical grounds, and there are other plausible explanations for the empirical observations.

      Thank you very much for these comments. We have now substantially expanded our investigations of the parameter space as outlined in the response to the “eLife Assessment” above and in the detailed comments below (A(1)-A(3)) to convey more quantitative intuition for the magnitude of the effects we describe for different phases of tumor evolution. We agree that there could be potential additional confounders to our empirical investigations besides the challenges regarding quantification that we already described in our initial version of the manuscript. We have thus included further discussion of these in our manuscript (see replies to B(1)-B(3)), and we have expanded our empirical analyses as outlined in the response to the “eLife Assessment”.

      For various reasons, the effect of punctuated mutation may be weaker than suggested by the theoretical and empirical analyses:

      (A1) The effect of punctuated mutation is much stronger when the first mutation of a two-step adaptation is deleterious (Figure 2). For double inactivation of a TSG, the first mutation--inactivation of one copy--would be expected to be neutral or slightly advantageous. The simulations depicted in Figure 4, which are supposed to demonstrate the expected effect for TSGs, assume that the first mutation is quite deleterious. This assumption seems inappropriate for TSGs, and perhaps the other synergistic pairs considered, and exaggerates the expected effects.

      Thank you for highlighting this discrepancy between Figure 2 and Figure 4. For computational efficiency and for illustration purposes, we had opted for high mutation rates and large fitness effects in Figure 2; however, our results are valid even in the setting of lower mutation rates and fitness effects. To improve the connection to Figure 4, and to address other related comments regarding parameter dependencies, we have now added more detailed quantification of the effects we describe (Figures SF3 and SF4) to the revised manuscript. These additions show that the effects illustrated in Figure 2 retain large effect sizes when going to much lower mutation rates and much smaller fitness effects. Indeed, while under high mutation rates we only see the large relative effects if the first mutation is highly deleterious, these large effects become more universal when going to low mutation rates.

      In general, it is correct that the selective disadvantage (or advantage) conveyed by the first mutation affects the likelihood of successful 2-step adaptations. It is also correct that the magnitude of the ‘relative effect’ of temporal clustering on valley-crossing is highest if the lineage with only the first of the two mutations is vanishingly unlikely to produce a second mutant before going extinct. If the first mutation is strongly deleterious, the lineage of such a first mutant is likely to quickly go extinct – and therefore also more likely to do so before producing a second mutant.

      However, this likelihood of producing the second mutant is also low if the mutation rate is low. As our added figure (Figure SF3) illustrates, at low mutation rates appropriate for cancer cells, is insensitive to the magnitude of the fitness disadvantage for large parts of the parameter space. Especially in populations of constant size (approximated by a birth/death ratio of 1), the relative effects for first mutations that reduce the birth rate by 0.5 or by 0.05 are indistinguishable (Figure SF3f).

      Moreover, the absolute effect , as we discuss in the paper (Figures SF2 and SF3) is largest in regions of the parameter space in which the first mutant is not infinitesimally unlikely to produce a second mutant (and 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> would be infinitesimally small), but rather in parameter regions in which this first mutant has a non-negligible chance to produce a second mutant. The absolute effect therefore peaks around fitness-neutral first mutations. While the next comment (below) says that our empirical investigations more closely resemble comparisons of relative effects and not absolute effects, we would expect that the observations in our data come preferentially from multi-step adaptations with large absolute effect since the absolute effect is maximal when both 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub>are relatively high.

      In summary, we believe Figure 2, while having exaggerated parameters for very defendable reasons, is not a misleading illustration of the general phenomenon or of its applicability in biological settings, as effect sizes remain large when moving to biologically realistic parameter ranges. To clarify this issue, we have largely rewritten the relevant paragraphs in the results section and have added two additional figures (Figures SF3 and SF4) as well as a section in the SI with detailed discussion (SI2).

      (A2) More generally, parameter values affect the magnitude of the effect. The authors note, for example, that the relative effect decreases with mutation rate. They suggest that the absolute effect, which increases, is more important, but the relative effect seems more relevant and is what is assessed empirically.

      Thank you for this comment. As noted in the replies to the above comments, we have now included extensive investigations of how sensitive effect sizes are to different parameter choices. We also apologize for insufficiently clearly communicating how the quantities in Figure 4 relate to the findings of our theoretical models.

      The challenge in relating our results to single-timepoint sequencing data is that we only observe the mutations that a tumor has acquired, but we do not directly observe the mutation rate histories that brought about these mutations. As an alternative readout, we therefore consider (through rough proxies: TSGs and APOBEC signatures) the amount of 2-step adaptations per acquired/retained mutation. While we unfortunately cannot control for the average mutation rate in a sample, we motivate using this “TSG-deactivation score” by the hypothesis that for any given mutation rate, we expect a positive relationship between the amount of temporal clustering and the amount of 2-step adaptations per acquired/retained mutation. This hypothesis follows directly from our theoretical model where it formally translates to the statement that for a fixed , is increasing in .

      However, while both quantities 𝑓<sub>𝑘</sub>/𝑓<sub>1</sub>  or from our theoretical model relate to this hypothesis – both are increasing in 𝑘–, neither of them maps directly onto the formulation of our empirical hypothesis.

      We have now rewritten the relevant passages of the manuscript to more clearly convey our motivation for constructing our TSG deactivation score in this form (P. 4-6).

      (A3) Routes to inactivation of both copies of a TSG that are not accelerated by punctuation will dilute any effects of punctuation. An example is a single somatic mutation followed by loss of heterozygosity. Such mechanisms are not included in the theoretical analysis nor assessed empirically. If, for example, 90% of double inactivations were the result of such mechanisms with a constant mutation rate, a factor of two effect of punctuated mutagenesis would increase the overall rate by only 10%. Consideration of the rate of apparent inactivation of just one TSG copy and of deletion of both copies would shed some light on the importance of this consideration.

      This is a very good point, thank you. In our empirical analyses, the main motivation was to investigate whether we would observe patterns that are qualitatively consistent with our theoretical predictions, i.e. whether we would find positive associations between valley-crossing and temporal clustering. Our aim in the empirical analyses was not to provide a quantitative estimate of how strongly temporally clustered mutation processes affect mutation accumulation in human cancers. We hence restricted attention to only one mutation process which is well characterized to be temporally clustered (APOBEC mutagenesis) and to only one category of (epi)genomic changes (SNPs, in which APOBEC signatures are well characterized). Of course, such an analysis ignores that other mutation processes (e.g. LOH, copy number changes, methylation in promoter regions, etc.) may interact with the mechanisms that we consider in deactivating Tumor suppressor genes.

      We have now updated the text to include further discussion of this limitation and further elaboration to convey that our empirical analyses are not intended as a complete quantification of the effect of temporal clustering on mutagenesis in-vivo (P. 10,11).

      Several factors besides the effects of punctuated mutation might explain or contribute to the empirical observations:

      (B1) High APOBEC3 activity can select for inactivation of TSGs (references in Butler and Banday 2023, PMID 36978147). This selective force is another plausible explanation for the empirical observations.

      Thank you for making this point. We agree that increased APOBEC3 activity, or any other similar perturbation, can change the fitness effect that any further changes/perturbations to the cell would bring about. Our empirical analyses therefore rely on the assumption that there are no major confounding structural differences in selection pressures between tumors with different levels of APOBEC signature contributions. We have expanded our discussion section to elaborate on this potential limitation (P. 10-11).

      While the hypothesis that APOBEC3 activity selects for inactivation of TSGSs has been suggested, there remain other explanations. Either way, the ways in which selective pressures have been suggested to change would not interfere relevantly with the effects we describe. The paper cited in the comment argues that “high APOBEC3 activity may generate a selective pressure favoring” TSG mutations as “APOBEC creates a high [mutation] burden, so cells with impaired DNA damage response (DDR) due to tumor suppressor mutations are more likely to avert apoptosis and continue proliferating”. To motivate this reasoning, in the same passage, the authors cite a high prevalence of TP53 mutations across several cancer types with “high burden of APOBEC3-induced mutations”, but also note that “this trend could arise from higher APOBEC3 expression in p53-mutated tumors since p53 may suppress APOBEC3B transcription via p21 and DREAM proteins”.

      Translated to our theoretical framework, this reasoning builds on the idea that APOBEC3 activity increases the selective advantage of mutants with inactivation of both copies of a TSG. In contrast, the mechanism we describe acts by altering the chances of mutants with only one TSG allele inactivated to inactivate the second allele before going extinct. If homozygous inactivation of TSGs generally conveys relatively strong fitness advantages, lineages with homozygous inactivation would already be unlikely to go extinct. Further increasing the fitness advantage of such lineages would thus manifest mostly in a quicker spread of these lineages, rather than in changes in the chance that these lineages survive. In turn, such a change would have limited effect on the “rate” at which such 2-step adaptations occur, but would mostly affect the speed at which they fixate. It would be interesting to investigate these effects empirically by quantifying the speed of proliferation and chance of going extinct for lineages that newly acquired inactivating mutations in TSGs.

      Beyond this explicit mention of selection pressures, the cited paper also discusses high occurrences of mutations in TSGs in relation to APOBEC. These enrichments, however, are not uniquely explained by an APOBEC-driven change in selection pressures. Indeed, our analyses would also predict such enrichments.

      (B2) Without punctuation, the rate of multistep adaptation is expected to rise more than linearly with mutation rate. Thus, if APOBEC signatures are correlated with a high mutation rate due to the action of APOBEC, this alone could explain the correlation with TSG inactivation.

      Thank you for making this point. Indeed, an identifying assumption that we make is that average mutation rates are balanced between samples with a higher vs lower APOBEC signature contribution. We cannot cleanly test this assumption, as we only observe aggregate mutation counts but not mutation rates. However, the fact that we observe an enrichment for APOBEC-associated mutations among the set of TSG-inactivating mutations (see Figure 4F) would be consistent with APOBEC-mutations driving the correlations in Fig 4D, rather than just average mutation rates. We have now added a paragraph to our manuscript to discuss these points (P. 10-11).

      (B3) The nature of mutations caused by APOBEC might explain the results. Notably, one of the two APOBEC mutation signatures, SBS13, is particularly likely to produce nonsense mutations. The authors count both nonsense and missense mutations, but nonsense mutations are more likely to inactivate the gene, and hence to be selected.

      Thank you for making this point.  We have included it in our discussion of potential confounders/limitations in the revised manuscript (P. 10-11).  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Specific questions/comments/suggestions:

      (1) For the theoretical investigation, the authors use the Wright-Fisher model with specific parameters for the decrease/increase in the fitness (0.5,1.5). This model is not so relevant to cancer, because it assumes a constant population size, while in cancer, the population is dynamic (increasing, if the tumor grows). Although I see they mention relevance to the branching process (in SI), I think the branching process should be bold in the main text and the Wright-Fisher in SI (or even dropped).

      Thank you for this comment. We agree that too little attention had been given to the branching process in the original version of our manuscript. While the Wright-Fisher process is computationally efficient to simulate and thus lends itself to clean simulations for illustrative examples, it did lead us to put undue emphasis on populations of constant size.

      The added Figures SF2 and SF3 now focus on branching processes, and we have substantially expanded our discussion of how dynamics differ as a function of the population-size trajectory (constant vs growing; SI2, P. 4,9,10). Generally, we do believe that it is appropriate to consider both regimes. If tumors evolve from being confined within their site of origin to progressively invading adjacent tissues and organ compartments, they traverse different regions of the birth-death ratio parameter space. Moreover, the timing of transitions between phases of more or less constrained growth is likely closely tied to adaptation dynamics, since breaching barriers to expansion requires adapting to novel environments and selection pressures.

      We hope that the revised version of the manuscript conveys these points more clearly, and thank you for alerting us to this imbalance in the original version of our manuscript.

      (2) The parameters 0.5 (decrease in fitness) and 1.5 (increase in fitness) seem exaggerated (the typical values for the selective advantage are usually much lower (by an order of magnitude). The same goes for the mutation rate. The authors chose values of the order 0.001, while in cancer (and generally) it is much lower than that (10-5 - 10-6). I think that generally, the authors should present a more systematic analysis of the sensitivity of the results to these parameters.

      Thank you very much for this very important comment. We have made this a major focus in our revisions (see our reply to the editor’s comments). As suggested, we have now added further analyses to explore more biologically relevant parameter regimes. Reviewer 2 has made a similar remark, and to avoid redundancies, we point for a more detailed response to our response to that comment (A1).

      (3) In Figure 3, the authors explore the sensitivity to mu (mutation rate) and k (temporal clustering) and find a non-monotonic behavior (Figure 3C). However, this behavior is not well explained. I think some more explanations are required here.

      Thank you for pointing this out. We had initially relegated the more detailed explanations to the SI2 (which in the revised manuscript became SI4), but are happy to provide more elaboration in the main text, and have done so now (P. 5).

      For , the non-monotonicity reflects the exploration-exploitation tradeoff that this section is dedicated to very small  values (little exploration) prevent the population from finding fitness peaks. In contrast, once a fitness peak is reached, excessively large  values (little exploitation) scatter the population away from this peak to points of lower fitness.

      For , the most relevant dynamic is that at high , the population becomes unable to find close-by fitness improvements (1-step adaptations) if it is not in a burst. As 𝑘 increases, this delay in adaptation (until a burst occurs) eventually comes to outweigh the benefits of high 𝑘 (better ability to undergo multi-step adaptations). Additionally, if 𝑘 ∙ μ becomes very large, clonal interference eventually leads to diminishing exploration-returns when 𝑘 is increased further (Fig 5C), as the per-cell likelihood of finding a specific fitness peak eventually saturates and increasing  only causes multiple cells to find the same peak, rather than one cell finding this peak and its lineage fixating in the population.

      (4) In Figure 5, where the authors show the accumulation of the first (red; deleterious mutation) and second (blue; advantageous mutation), it seems that the fraction of deleterious mutations is much lower than that of advantageous mutations. This is opposite to the case of cancer, where most of the mutations are 'passengers', (slightly) deleterious or neutral mutations. Can the author explain this discrepancy and generally the relation of their parametrization to deleterious vs. advantageous mutations?

      Thank you for this comment. In general, we have focused attention in our paper on sequences of mutations that bring about a fitness increase. We call those sequences ‘adaptations’ and categorize these as one-step or multi-step, depending on whether or not they contain intermediates states with a fitness disadvantage.

      In our modelling, we do not consider mutations that are simply deleterious and are not a necessary part of a multi-step adaptation sequence. The motivation for this abstraction is, firstly, to focus on adaptation dynamics, and secondly, that in certain limits (small mu and large constant population sizes), lineages with only deleterious mutations have a probability close to one of going extinct, so that any emerging deleterious mutant would likely be 'washed out’ of the population before a new mutation emerges.

      However, whether the dynamics of how neutral or deleterious passenger mutations are acquired also vary relevantly with the extent of temporal clustering is a valid and interesting question that would warrant its own study. The types of theoretical arguments for such an investigation would be very similar to the ones we use in our paper.

      (5) The theoretical investigation assumes a multi/2-step adaptation scenario where the first mutation is deleterious and the second is advantageous. I think this should be generalized and further explored. For example, what happens when there are multiple mutations that are slightly deleterious (as probably is the case in cancer) and only much later mutations confer a selective advantage? How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?

      This is also an important point and relates in part to the previous comment (4).  For discussion of interactions with deleterious mutations, please see the reply to comment (4).  

      Regarding generalizations of this valley-crossing scenario, note that any sequence of mutations that increases fitness can be decomposed into sequences of either one-step or multi-step adaptations, as defined  in the paper. Therefore, if all intermediate states before the final selectively advantageous state have a selective disadvantage making the lineages of such cells likely to go extinct, then our derivations in S1 apply, and the relative effect of temporal clustering becomes where n is the number of intermediate states. If, conversely, any of the intermediate states already had a selective advantage, then our model would consider the subsequence until this first mutation with a selective advantage as its individual (one-step or multi-step) “adaptation”.

      The second question, “How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?”, touches on a different property of the population dynamics, namely on how the fate of a mutant lineage depends on how this lineage emerged. In our paper, we compare different levels of temporal clustering for a fixed average mutation rate. This choice implies that, if we assume that the mutant that emerges from a valley-crossing does not go extinct, then the number of deleterious mutations expected to occur in this lineage, once emerged, will not depend on the extent of temporal clustering. However, if in-burst mutation rates increased the expected burden of early acquired deleterious mutations sufficiently much to affect the probability that the lineage with a multi-step adaptation goes extinct before the burst ends, then there may indeed be an interaction between effects of deleterious passengers and temporal clustering. We would, however, expect effects on this probability of early extinction to be relatively minor, since such a lineage with a selective advantage would quickly grow to large cell-numbers implying that it would require a large number of co-occurring and sufficiently deleterious mutations across these cells for the lineage to go extinct.

      (6) For the empirical analysis of TCGA cohorts, the authors focus on the contribution of APOBEC mutations (via signature analysis) to temporal mutagenesis. They find only a few cancer types (Figure 4D) that follow their prediction (in Figure 4C) of a correlation between TSG deactivation and temporal mutations in bursts. I think two main points should be addressed:

      Thank you for this comment. We will respond in detail to the corresponding points below, but would like to note here that while we find this correlation “in only a few cancer types”, we also show that only few cancer types have relevant proportions of mutations caused by APOBEC, and it is precisely in these cancer types that we find a correlation.  We have clarified this aspect in the revised version of the manuscript (P.7).

      (i) APOBEC is not the only cause for temporal mutagenesis. For example, elevated ROS and hypoxia are also potential contributors - it might therefore be important to extend the signature analysis (to include more possible sources for temporal mutagenesis). Potentially, such an extension may show that more cancer types follow the author's prediction.

      Thank you for this interesting suggestion. We have now included analogous analyses for contributions of signature SBS18 which is associated with ROS mutagenesis, and for the joint contribution of signatures SBS17a, SBS17b, SBS18 and SBS36, which all have been shown (some in a more context-dependent manner) to be associated with ROS mutagenesis. When doing so, we do not find a clear trend. However, we also do not find these signatures to account for substantial proportions of the acquired mutations, meaning that ROS mutagenesis likely also does not account for much of the variation in how temporally clustered the mutation rate trajectories of different tumors are. We have incorporated these results and their discussion in the manuscript (SI5 and Fig S8).

      (ii) The TSG deactivation score used by the authors only counts the number of mutations and does not consider if the 2 mutations are biallelic, which is highly important in this case. There are ways to investigate the specific allele of mutations in TCGA data (for example, see Ciani et al. Cell Sys 2022 PMID: 34731645). Given the focus on TSG of this study, I think it is important to account for this in the analysis.

      Thank you for making this point. We did initially consider inferring allele-specific mutation status, but decided against it as this would have shrunk our dataset substantially, thus potentially introducing unwanted biases. Determining whether two mutations lie on the same or on different alleles requires either (1) observing sequencing reads that either cover the loci of both mutations, or (2) tracing whether (sets of) other SNPs on the same gene co-occur exclusively with one of the two considered mutations. These requirements lead to a substantial filtering of the observed mutations. Moreover, this filtering would be especially strong for tumors with a small overall mutation burden, as these would have fewer co-occurring SNPs to leverage in this inference. We would have hence preferentially filtered out TSG-deactivating mutations in tumors with low mutation burden. We have modified the text to address this point (P.14).

      (7) To continue point 4. I wonder why some known cancer types with high APOBEC signatures (e.g., lung, mentioned in the introduction) do not appear in the results of Figure 4. Can the author explain why it is missed?

      We do provide complete results for all categories in Supplementary Figure 3. To not overwhelm the figure in the main text, we only show the four categories with the highest average APOBEC signature contribution, beyond those four, average APOBEC signature contributions quickly drop. Lung-related categories do not feature in these top four (Lung squamous cell carcinoma are fifth and Lung adenocarcinoma are eighth in this ordering).

      Minors:

      (1) It is worth mentioning the relevance to resistance to treatment (see https://www.nature.com/articles/s41588-025-02187-1).

      Thank you for this suggestion. We have included a mention of the relation to this paper in the discussion section (P. 11).

      (2) Some of the figures' resolution should be improved - specifically, Figures 4, S1, and S5, which are not clear/readable.

      Thank you for pointing this out. This was the result of conversion to a word document. We will provide tif files in the revisions to have better resolution.

      (3) Regarding Figure 3e,f. How come that moving from K=1 to K=I doesn't show any changes in fitness - it looks as if in both cases the value fluctuates around comparable mean fitness? Is that the case?

      While fitness differences between simulations with different k manifest robustly over long time-horizons (see Fig 3C with results over  generations), there are various sources of substantial stochasticity that make the fitness values in these short-term plots (Fig3D-F) imperfect illustrations of how long-term average fitness behaves. For instance, fitness landscapes are drawn randomly which introduces variability in how high and how close-by different fitness peaks are. Similarly, there is substantial randomness since both the type (direction on the 2-D fitness landscape) and the timing of mutation are stochastic.

      The short-term plots in Fig3D-F are intended to showcase representative dynamics of transitions between points on the genotype space with different fitness values following a redrawing of the landscape – but not necessarily to provide a comparison between the height of the attained (local) fitness-maxima.  

      (4) Figures 4c,d - correlation should be Spearman, not Pearson (it's not a linear relationship).

      Thank you for this comment. As a robustness check, we have generated the same figures using Spearman and not Pearson correlations and find results that are qualitatively consistent with the initially shown results. Indeed, using Spearman correlations, all four cancer types from Fig 4D have significant correlations.

      (5) Typo for E) "...in samples of the cancer types in (C) were caused by APOBEC" - it should be D (not C) I guess.

      Thank you for catching this. We fixed the typo.

      (6) Figure 5 - the mutation rate is too high (0.001), sensitivity to that? Also the fitness change is exaggerated (0.5, 1.5), and the division of mutations to 100 and 100 (200 in total) loci is not clear.

      Thank you for making this point. In this simulation setting it is unfortunately computationally prohibitively expensive to perform simulations at biologically realistic mutation rates. Therefore, we have scaled up the mutation rate while scaling down the population size. Moreover, the choice of model here is not meant to resemble a biologically realistic dynamic, but rather to create a stylized setting to be able to consider the interplay between clonal interference and facilitated valley-crossing in isolation. The key result from this figure is the separation of time scales at which low or high temporal clustering maximizes adaptability.

      However, known parameter dependencies in these models allow us to reason about how tuning individual parameters of this stylized model would affect the relative importance of effects of clonal interference. This relative importance is largest when mutants are likely to co-occur on different competing clones in a population. The likelihood of such co-occurrences decreases substantially if decreasing the mutation rate to biologically realistic values. However, this likelihood also sensitively depends on the time that it takes a clone with a one-step adaptation to spread through the population. Smaller fitness advantages, as well as larger population sizes, slow down this process of taking over the population, which increases the likelihood of clonal interference. We now discuss these points in our revised manuscript (P. 8).

      7) In the results text (last section) "Performing simulations for 2-step adaptations, we found that fixation rates are non-monotone in k. While at low k increasing k leads to a steep increase in the fixation rate, this trend eventually levels off and becomes negative, with further increases in k leading to a decrease in the fixation rate". Where are the results of this? It should be bold and apparent.

      Thank you for alerting us that this is unclear. The relevant figure reference is indeed Fig 5C as in the preceding passage in the manuscript. However, we noticed that due to the presence of the steadily decreasing black line for 1-step adaptations, it is not easy to see that also the blue line is downward sloping. We have added a further reference to Fig 5C, and have adapted the grid spacing in the background of that figure-panel to make this trend more easily visible.

      (8) Although not inconceivable, conclusions regarding resistance in the discussion are overstated. If you want to make this statement, you need to show that in resistant tumors, the temporal mutagenesis is responsible for progression vs. non-resistant/sensitive cases (is that the case), otherwise this should be toned down.

      Thank you for pointing this out. We have tempered these conclusions in the revised version of the manuscript (P. 11).

      Reviewer #2 (Recommendations for the authors):

      (1) It might be useful to look specifically at X-linked TSGs. On the authors' interpretation, their relative inactivation rates should not be correlated with APOBEC signatures in males (but should be in females), though the size of the dataset may preclude any definite conclusions.

      Thank you for this suggestion. Indeed, the size of the dataset unfortunately makes such analyses infeasible. Moreover, it is not clear whether X-linked TSGs might have structurally different fitness dynamics than TSGs on other chromosomes. However, this is an interesting suggestion worth following up on as more synergistic pairs confined to the X-chromosome are getting identified.

      (2) Might there be value in distinguishing tumors that carry mutations expected to increase APOBEC expression from those that do not? Among several reasons, an APOBEC signature due to such a mutation and an APOBEC signature due to abortive viral infection may differ with respect to the degree of punctuation.

      This is also an interesting suggestion for future investigations, but for which we unfortunately do not have sufficient information to build a meaningful analysis. In particular, it is unclear to what extent the degree and manifestation of episodicity/punctuation varies between these different mechanisms. Burst duration and intensity, as well as out-of-burst baseline rates of APOBEC mutagenesis likely differ in ways that are yet insufficiently characterized, which would make any result of analyses like these in Fig 4 hard to interpret.

      (3) Also, in that paragraph, is "proportional to" used loosely to mean "an increasing function of"?

      Thank you for this comment. We are not quite sure which paragraph is meant, but we use the term “proportional” in a literal sense at every point it is mentioned in the paper.

      For the occurrences of the term on pages 3, 10 and 11, the word is used in reference to probabilities of reproduction (division in the branching process, or ‘being drawn to populate a spot in the next generation’ in the WF process) being “proportional” to fitness. These probabilities are constructed by dividing each individual cell’s fitness by the total fitness summed across all cells in the population. As the population acquires fitness-enhancing mutations, the resulting proportionality constant (1/total_fitness) changes, so that the mapping from ‘fitness’ to probability of reproduction in the next reproduction event changes over time. Nevertheless, this mapping always remains fitness-proportional.

      On page 4, the term is used as follows: “the absolute rates 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> are proportional to µ<sup>n+1”</sup>. Here, proportionality in the literal sense follows from the equations on page 20, when setting , so that the second factor becomes µ<sup>n+1</sup>.  We have included a clarifying sentence to address this in the derivations (SI1).

      (4) It could be mentioned in the main text that the time between bursts (d) must not be too short in order for the effect to be substantial. I would think that the relevant timescale depends on how deleterious the initial mutation is.

      Thank you for making this interesting and very relevant point. We have included a section (SI3) and Figure (Fig S4) in the supplement to investigate the dependence on d. In short, we find that effects are weaker for small inter-burst intervals. The sensitivity to the burst size is highest for inter-burst intervals that are sufficiently small so that the lineage of the first mutant has relevant probability of surviving long enough to experience multiple burst phases.

      (5) Why not report that relative rate for Figure 2E as for 2D, as the former would seem to be more relevant to TSGs? And why was it assumed that the first inactivation is deleterious in the simulations in Figure 4 if the goal is to model TSGs?

      Thank you for noting this. For how we revised the paper to better connect Figures 2 and 4, please see our comment (A1) above. In general, neither 2E nor 2D should serve as quantitative predictions for what effect size we should expect in real world data, but are rather curated illustrations of the general phenomenon that we describe: we chose high mutation rates and exaggerated fitness effects so that dynamics become visually tractable in small simulation examples.

      For figure 4, assuming that the first inactivation is deleterious achieves that the branching process for the mutant lineage becomes subcritical, which keeps the simulation example simple and illustrative. For more comprehensive motivation of the approach in 4D, and especially the discussion of how fitness effects of different magnitudes may or may not be subject to the effects we describe depending on whether the population is in a phase of constant or growing population size, we refer the reader to our added section SI2, and the added discussion on pages 6 and 10.

      (6) Figure 2, D and E. I'm not sure why heatmaps with height one were provided rather than simple plots over time. It is difficult, for example, to determine from a heatmap whether the increase is linear or the relative rates with and without punctuation.

      Thank you for this comment. These are not heatmaps with height one, but rather for every column of pixels, different segments of that column correspond to different clones within that population. This approach is intended to convey the difference in dynamics between the results in Fig 2 and the analogous results for a branching process in Fig S1. In Fig 2, valley-crossings happen sequentially, with subsequent fixations of adapted mutants. In Fig S1, with a growing population size, multiple clones with different numbers of adaptations coexist. We have now adapted the caption of Fig 2 to clarify this point.

      (7) Page 3: "High mutation rates are known to limit the rate of 1-step adaptations due to clonal interference." This is a bit misleading, as it makes it sound like increasing the mutation rate decreases the rate of one-step adaptations.

      Thank you for alerting us to this poor phrasing. We have changed it in the revised version of the manuscript (P. 3).

      (8) Page 4: "proportional to \mu^{n+1}" Is "proportional" being used loosely for "an increasing function of"?

      It is meant in the literal mathematical sense (see response to comment (3))

      (9) Page 5, near bottom: "at least two mutations across the population". In the same genome?

      We counted mutations irrespective of whether they emerged in the same genome, to remain analogous to the TCGA analyses for which we also do not have single cell-resolved information.

      (10) Page 6: "missense or nonsense mutation". What about indels? If these are not affected by APOBEC, omitting them will exaggerate the effect of punctuation.

      Thank you for pointing out that this focus on single nucleotide substitutions conveys an exaggerated image of the importance of this effect of APOBEC-driven mutagenesis. There are of course several other classes of (epi)genomic alterations (e.g. chromatin modifications, methylation changes, copy number changes) that we do not consider in this part of our analysis. APOBEC mutagenesis serves as an example of a temporally clustered mutation process, which we investigate in its domain of action.

      We have added further discussion (P. 10-11) to convey that our empirical results merely constitute an investigation of whether empirical patterns are consistent with our hypothesis, but that the narrow focus on only SNVs, only TSGs, and only APOBEC mutagenesis does not allow for a general quantitative statement about the in-vivo relevance of the phenomena we describe.

      (11) Page 6: "normalized by the total number of single nucleotide substitutions." It is difficult to know how to normalize correctly, but I might think that the square of the number of substitutions would be more appropriate. Perhaps the total numbers are close enough that it matters little.

      Thank you for noting this. In the revised manuscript we have now expanded this passage in the text to more clearly convey our motivations for why we normalize by the total number of single nucleotide substitutions. While the likelihood for crossing a fitness valley with 2 mutations is indeed proportional to the square of the mutation rate, we do not directly observe mutation rates from our data.  Rather, we observe the number of acquired single nucleotide substitutions for every tumor sample, but since tumors in our data differ in the time since initiation and therefore differ in the numbers of divisions their cells have undergone before being sequenced, we cannot directly infer mutation rates. One way to phrase our main result about valley-crossing is that temporally clustered mutation processes have an increased rate of successful valley-crossings per attempted valley crossing. Our TSG deactivation score is constructed to reflect this idea. The number of TSGs serves as a proxy for successful valley-crossings and the total mutation burden serves as a proxy for attempted valley-crossings.

      To convey these points more clearly, we have rewritten the first paragraph in the Section “Proxies for valley crossing and for temporal clustering found in patient data” (P.6)

      (12) Perhaps embed links to the COSMIC web pages for SBS2 and SBS13 in the text.

      Thank you for this suggestion. We have embedded the links at the first mention of SBS2 and SBS13 in the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In the Late Triassic and Early Jurassic (around 230 to 180 Ma ago), southern Wales and adjacent parts of England were a karst landscape. The caves and crevices accumulated remains of small vertebrates. These fossil-rich fissure fills are being exposed in limestone quarrying. In 2022 (reference 13 of the article), a partial articulated skeleton and numerous isolated bones from one fissure fill of end-Triassic age (just over 200 Ma) were named Cryptovaranoides microlanius and described as the oldest known squamate - the oldest known animal, by some 20 to 30 Ma, that is more closely related to snakes and some extant lizards than to other extant lizards. This would have considerable consequences for our understanding of the evolution of squamates and their closest relatives, especially for their speed and absolute timing, and was supported in the same paper by phylogenetic analyses based on different datasets.

      In 2023, the present authors published a rebuttal (reference 18) to the 2022 paper, challenging anatomical interpretations and the irreproducible referral of some of the isolated bones to Cryptovaranoides. Modifying the datasets accordingly, they found Cryptovaranoides outside Squamata and presented evidence that it is far outside. In 2024 (reference 19), the original authors defended most of their original interpretation and presented some new data, some of it from newly referred isolated bones. The present article discusses anatomical features and the referral of isolated bones in more detail, documents some clear misinterpretations, argues against the widespread but not justifiable practice of referring isolated bones to the same species as long as there is merely no known evidence to the contrary, further argues against comparing newly recognized fossils to lists of diagnostic characters from the literature as opposed to performing phylogenetic analyses and interpreting the results, and finds Cryptovaranoides outside Squamata again.

      Although a few of the character discussions and the discussion of at least one of the isolated bones can probably still be improved (and two characters are addressed twice), I see no sign that the discussion is going in circles or otherwise becoming unproductive. I can even imagine that the present contribution will end it.

      We appreciate the positive response from reviewer 1!

      Reviewer #2 (Public review):

      Congratulations on this thorough manuscript on the phylogenetic affinities of Cryptovaranoides.

      Thank you.

      Recent interpretations of this taxon, and perhaps some others, have greatly changed the field's understanding of reptile origins- for better and (likely) for worse.

      We agree, and note that while it is possible for challenges to be worse than the original interpretations, both the original and subsequent challenges are essential aspects of what make science, science.

      This manuscript offers a careful review of the features used to place Cryptovaranoides within Squamata and adequately demonstrates that this interpretation is misguided, and therefore reconciles morphological and molecular data, which is an important contribution to the field of paleontology. The presence of any crown squamate in the Permian or Triassic should be met with skepticism, the same sort of skepticism provided in this manuscript.

      We agree and add that every testable hypothesis requires skepticism and testing.

      I have outlined some comments addressing some weaknesses that I believe will further elevate the scientific quality of the work. A brief, fresh read‑through to refine a few phrases, particularly where the discussion references Whiteside et al. could also give the paper an even more collegial tone.

      We have followed Reviewer 2’s recommendations closely (see below) and have justified in our responses if we do not fully follow a particular recommendation.

      This manuscript can be largely improved by additional discussion and figures, where applicable. When I first read this manuscript, I was a bit surprised at how little discussion there was concerning both non-lepidosauromorph lepidosaurs as well as stem-reptiles more broadly. This paper makes it extremely clear that Cryptovaranoides is not a squamate, but would greatly benefit in explaining why many of the characters either suggested by former studies to be squamate in nature or were optimized as such in phylogenetic analyses are rather widespread plesiomorphies present in crownward sauropsids such as millerettids, younginids, or tangasaurids. I suggest citing this work where applicable and building some of the discussion for a greatly improved manuscript. In sum:

      (1) The discussion of stem-reptiles should be improved. Nearly all of the supposed squamate features in Cryptovaranoides are present in various stem-reptile groups. I've noted a few, but this would be a fairly quick addition to this work. If this manuscript incorporates this advice, I believe arguments regarding the affinities of Cryptovaranoides (at least within Squamata) will be finished, and this manuscript will be better off for it.

      (2) I was also surprised at how little discussion there was here of putative stem-squamates or lepidosauromorphs more broadly. A few targeted comparisons could really benefit the manuscript. It is currently unclear as to why Cryptovaranoides could not be a stem-lepidosaur, although I know that the lepidosaur total-group in these manuscripts lacks character sampling due to their scarcity.

      We are responding to (1) and (2) together. We agree with the Reviewer that a thorough comparison of Cryptovaranoides to non-lepidosaurian reptiles is critical. This is precisely what we did in our previous study: Brownstein et al. (2023)— see main text and supplementary information therein. As addressed therein, there is a substantial convergence between early lepidosaurs and some groups of archosauromorphs (our inferred position for Cryptovaranoides). Many of those points are not addressed in detail here in order to avoid redundancy and are simply referenced back to Brownstein et al. (2023). Secondly, stem reptiles (i.e., non-lepidosauromorphs and non-archosauromorphs), such as suggested above (millerettids, younginids, or tangasaurids), are substantially more distantly related to Cryptovaranoides (following any of the published hypotheses). As such, they share fewer traits (either symplesiomorphies or homoplasies), and so, in our opinion, we would risk directing losing the squamate-focus of our study.

      We thus respectfully decline to engage the full scope of the problem in this contribution, but do note that this level of detailed work would make for an excellent student dissertation research program.

      (3) This manuscript can be improved by additional figures, such as the slice data of the humerus. The poor quality of the scan data for Cryptovaranoides is stated during this paper several times, yet the scan data is often used as evidence for the presence or absence of often minute features without discussion, leaving doubts as to what condition is true. Otherwise, several sections can be rephrased to acknowledge uncertainty, and probably change some character scorings to '?' in other studies.

      We strongly agree with the reviewer. Unfortunately, the original publication (Whiteside et al., 2021) did not make available the raw CT scan data to make this possible. As noted below in the Responses to Recommendations Section, we only have access to the mesh files for each segmented element. While one of us has observed the specimens personally, we have not had the opportunity to CT scan the specimens ourselves.

      Reviewer #3 (Public review):

      Summary:

      The study provides an interesting contribution to our understanding of Cryptovaranoides relationships, which is a matter of intensive debate among researchers. My main concerns are in regard to the wording of some statements, but generally, the discussion and data are well prepared. I would recommend moderate revisions.

      Strengths:

      (1) Detailed analysis of the discussed characters.

      (2) Illustrations of some comparative materials.

      Thank you for noting the strengths inherent to our study.

      Weaknesses:

      Some parts of the manuscript require clarification and rewording.

      One of the main points of criticism of Whiteside et al. is using characters for phylogenetic considerations that are not included in the phylogenetic analyses therein. The authors call it a "non-trivial substantive methodological flaw" (page 19, line 531). I would step down from such a statement for the reasons listed below:

      (1) Comparative anatomy is not about making phylogenetic analyses. Comparative anatomy is about comparing different taxa in search of characters that are unique and characters that are shared between taxa. This creates an opportunity to assess the level of similarity between the taxa and create preliminary hypotheses about homology. Therefore, comparative anatomy can provide some phylogenetic inferences.

      That does not mean that tests of congruence are not needed. Such comparisons are the first step that allows creating phylogenetic matrices for analysis, which is the next step of phylogenetic inference. That does not mean that all the papers with new morphological comparisons should end with a new or expanded phylogenetic matrix. Instead, such papers serve as a rationale for future papers that focus on building phylogenetic matrices.

      We agree completely. We would also add that not every study presenting comparative anatomical work need be concluded with a phylogenetic analysis.

      Our criticism of Whiteside et al. (2022) and (2024) is that these studies provided many unsubstantiated claims of having recovered synapomorphies between Cryptovaranoides and crown squamates without actually having done so through the standard empirical means (i.e., phylogenetic analysis and ancestral state reconstruction). Both Whiteside et al. (2022) and (2024) indicate characters presented as ‘shared with squamates’ along with 10 characters presented as synapomorphies (10). However, their actual phylogenetically recovered synapomorphies were few in number (only 3) and these were not discussed.

      Furthermore, Whiteside et al. (2022) and (2024) comparative anatomy was restricted to comparing †Cryptovaranoides to crown squamates., based on the assumption that †Cryptovaranoides was a crown squamate and thus only needed to be compared to crown squamates.

      In conclusion, we respectfully, we maintain such efforts are “non-trivial substantive methodological flaw(s)”.

      (2) Phylogenetic matrices are never complete, both in terms of morphological disparity and taxonomic diversity. I don't know if it is even possible to have a complete one, but at least we can say that we are far from that. Criticising a work that did not include all the possibly relevant characters in the phylogenetic analysis is simply unfair. The authors should know that creating/expanding a phylogenetic matrix is a never-ending work, beyond the scope of any paper presenting a new fossil.

      Respectfully, we did not criticize previous studies for including an incomplete phylogeny. Instead, we criticized the methodology behind the homology statements made in Whiteside et al. (2022) and Whiteside et al. (2024).

      (3) Each additional taxon has the possibility of inducing a rethinking of characters. That includes new characters, new character states, character state reordering, etc. As I said above, it is usually beyond the scope of a paper with a new fossil to accommodate that into the phylogenetic matrix, as it requires not only scoring the newly described taxon but also many that are already scored. Since the digitalization of fossils is still rare, it requires a lot of collection visits that are costly in terms of time.

      We agree on all points, but we are unsure of what the Reviewer is asking us to do relative to this study.

      (4) If I were to search for a true flaw in the Whiteside et al. paper, I would check if there is a confirmation bias. The mentioned paper should not only search for characters that support Cryptovaranoides affinities with Anguimorpha but also characters that deny that. I am not sure if Whiteside et al. did such an exercise. Anyway, the test of congruence would not solve this issue because by adding only characters that support one hypothesis, we are biasing the results of such a test.

      We would refer the Reviewer to their section (1) on comparative anatomy. As we and the Reviewer have pointed out, Whiteside et al. did not perform comparative anatomical statements outside of crown Squamata in their original study. More specifically, Whiteside et al. (2022, Fig. 8) presented a phylogeny where Cryptovaranoides formed a clade with Xenosaurus within the crown of Anguimorpha or what they termed “Anguiformes”, and made comparisons to the anatomies of the legless anguids, Pseudopus and Ophisaurus. Whiteside et al. (2024), abandoned “Anguiformes”, maintained comparisons to Pseudopus and emphasized affinities with Anguimorpha (but almost all of their phylogenies as published, they do not recover a monophyletic Angumimorpha unless amphisbaenians and snakes are considered to be anguimorphans. Thus, we agree that confirmation bias was inherent in their studies.

      To sum up, there is nothing wrong with proposing some hypotheses about character homology between different taxa that can be tested in future papers that will include a test of congruence. Lack of such a test makes the whole argumentation weaker in Whiteside et al., but not unacceptable, as the manuscript might suggest. My advice is to step down from such strong statements like "methodological flaw" and "empirical problems" and replace them with "limitations", which I think better describes the situation.

      We agree with the first sentence in this paragraph – there is nothing wrong with proposing character homologies between different taxa based on comparative anatomical studies. However, that is not what Whiteside et al. (2022) and (2024) did. Instead, they claimed that an ad hoc comparison of Cryptovaranoides to crown Squamata confirmed that Cryptovaranoides is in fact a crown squamate and likely a member of Anguimorpha. Their study did not recognize limitations, but rather, concluded that their new taxon pushed the age of crown Squamata into the Triassic.

      As noted by Reviewer 2, such a claim, and the ‘data’ upon which it is based, should be treated with skepticism. We have elected to apply strong skepticism and stringent tests of falsification to our critique.

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 596-598 promise the following: "we provide a long[-]form review of these and other features in Cryptovaranoides that compare favorably with non-squamate reptiles in Supplementary Material." You have kindly informed me that all this material has been moved into the main text; please amend this passage.

      This has been deleted.

      (2) Comments on science

      41: I would rather say "an additional role".

      This has been edited accordingly.

      43: Reconstructing the tree entirely from extant organisms and adding fossils later is how Hennig imagined it, because he was an entomologist, and fossil insects are, on average,e extremely rare and usually very incomplete (showing a body outline and/or wing venation and little or nothing else). He was wrong, indeed wrong-headed. As a historical matter, phylogenetic hypotheses were routinely built on fossils by the mid-1860s, pretty much as soon as the paleontologists had finished reading On the Origin of Species, and this practice has never declined, let alone been interrupted. As a theoretical matter, including as many extinct taxa as possible in a phylogenetic analysis is desirable because it breaks up long branches (as most recently and dramatically shown by Mongiardino Koch & Parry 2020), and while some methods and some kinds of data are less susceptible to long-branch attraction and long-branch repulsion than others, none are immune; and while missing data (on average more common in fossils) can actively mislead parametric methods, this is not the case with parsimony, and even in Bayesian inference the problem is characters with missing data, not taxa with missing data. Some of you have, moreover, published tip-dated phylogenetic analyses. As a practical matter, molecular data are almost never available from fossils, so it is, of course, true that analyses which only use molecular data can almost never include fossils; but in the very rare exceptions, there is no reason to treat fossil evidence as an afterthought.

      We agree and have changed “have become” to “is.”

      49-50, 59: The ages of individual fissure fills can be determined by biostratigraphy; as far as I understand, all specimens ever referred to Cryptovaranoides [13, 19] come from a single fill that is "Rhaetian, probably late Rhaetian (equivalent of Cotham Member, Lilstock Formation)" [13: pp. 2, 15].

      We appreciate this comment; the recent literature, however, suggests that variable ages are implied by the biostratigraphy at the English Fissure Fills, so we have chosen to keep this as is. Also note that several isolated bones were not recovered with the holotype but were discussed by Whiteside et al. (2024). The provenance of these bones was not clearly discussed in that paper.

      59-60: Why "putative"? Just to express your disagreement? I would do that in a less misleading way, for example: "and found this taxon as a crown-group squamate (squamate hereafter) in their phylogenetic analyses." - plural because [19] presented four different analyses of two matrices just in the main paper.

      We have removed this word.

      121-124: The entepicondylar foramen is homologous all the way down the tree to Eusthenopteron and beyond. It has been lost a quite small number of times. The ectepicondylar foramen - i.e., the "supinator" (brachioradialis) process growing distally to meet the ectepicondyle, fusing with it and thereby enclosing the foramen - goes a bit beyond Neodiapsida and also occurs in a few other amniote clades (...as well as, funnily enough, Eusthenopteron in later ontogeny, but that's independent).

      We agree. However, the important note here is that the features on the humerus of Cryptovaranoides are not comparable (differ in location and morphology) to the ent- and ectepondylar foramina in other reptiles, as we discuss at length. As such, we have kept this sentence as is.

      153: Yes, but you [18] mistakenly wrote "strong anterior emargination of the maxillary nasal process, which is [...] a hallmark feature of archosauromorphs" in the main text (p. 14) - and you make the same mistake again here in lines 200-206! Also, the fact [19: Figure 2a-c] remains that Cryptovaranoides did not have an antorbital fenestra, let alone an antorbital fossa surrounding it (a fossa without a fenestra only occurs in some cases of secondary loss of the fenestra, e.g., in certain ornithischian dinosaurs). Unsurprisingly, therefore, Cryptovaranoides also does not have an orbital-as-opposed-to-nasal process on its maxilla [19: Figure 2a-c].

      Line 243-249 (in original manuscript) deal with the emargination of maxillary nasal process (but this does not imply a full antorbital fenestra).  We explicitly state that this feature alone "has limited utility" for supporting archosauromorph affinity.

      158-173: The problem here is not that the capitellum is not preserved; from amniotes and "microsaurs" to lissamphibians and temnospondyls, capitella ossify late, and larger capitella attach to proportionately larger concave surfaces, so there is nothing wrong with "the cavity in which it sat clearly indicates a substantial condyle in life". Instead, the problem is a lack of quantification (...as has also been the case in the use of the exact same character in the debate on the origin of lissamphibians); your following sentence (lines 173-175) stands. The rest of the paragraph should be drastically shortened.

      We appreciate this comment. We note that the ontogenetic variation of this feature is in part the issue with the interpretation provided by Whiteside et al. (2024). The issue is the lack of consistency on the morphology of the capitellum in that study. We are unclear on what the reviewer means by ‘quantification,’ as the character in question is binary. 

      250-252: It's not going to matter here, but in any different phylogenetic context, "sphenoid" would be confusing given the sphenethmoid, orbitosphenoid, pleurosphenoid, and laterosphenoid. I actually recommend "parabasisphenoid" as used in the literature on early amniotes (fusion of the dermal parasphenoid and the endochondral basisphenoid is standard for amniotes).

      We have added "(=parabasisphenoid)" on first use but retain use of sphenoid because in the squamate and archosauromorph literature, sphenoid (or basisphenoid) is used more frequently.

      314-315: Vomerine teeth are, of course, standard for sarcopterygians. Practically all extant amphibians have a vomerine toothrow, for example. A shagreen of denticles on the vomer is not as widespread but still reaches into the Devonian (Tulerpeton).

      We agree, but vomerine teeth are rare in lepidosaurs and archosaurs and occur only in very recent clades e.g. anguids and one stem scincoid. Their presence in amphibians is not directly relevant to the phylogenetic placement of Cryptovaranoides among reptiles.

      372: Fusion was not scored as present in [13], but as unknown (as "partial" uncertainty between states 0 and 1 [19:8]), and seemingly all three options were explored in [19].

      We politely disagree with the reviewer; state 1 is scored in Whiteside et al. (2024).

      377-383: Together with the partially fused NHMUK PV R37378 [13: Figure 4B, C; 19: 8], this is actually an argument that Cryptovaranoides is outside but close to Unidentata. The components of the astragalus fuse so early in extant amniotes that there is just a single ossification center in the already fused cartilage, but there are Carboniferous and Permian examples of astragali with sutures in the expected places; all of the animals in question (Diadectes, Hylonomus, captorhinids) seem to be close to but outside Amniota. (And yet, the astragalus has come undone in chamaeleons, indicating the components have not been lost.) - Also, if NHMUK PV R37378 doesn't belong to a squamate close to Unidentata, what does it belong to? Except in toothless beaks, premaxillary fusion is really rare; only molgin newts come to mind (and age, tooth size, and tooth number of NHMUK PV R37378 are wholly incompatible with a salamandrid).

      The relevance of the astragalus is to the current discussion is unclear as we do not mention this element in our manuscript.  We discuss the fusion in the premaxillae in response to previous comment. 

      471-474: That thing is concave. (The photo is good enough that you can enlarge it to 800% before it becomes too pixelated.) It could be a foramen filled with matrix; it does not look like a grain sticking to the outside of the bone. Also, spell out that you're talking about "suc.fo" in Figure 3j.

      We are also a bit confused about this comment, as we state:

      “Finally, we note here that Whiteside et al. [19] appear to have labeled a small piece of matrix attached to a coracoid that they refer to †C. microlanius as the supracoroacoid [sic] foramen in their figure 3, although this labeling is inferred because only “suc, supracoroacoid [sic]” is present in their figure 3 caption.” (L. 519-522, P. 17). We cannot verify that this structure is concave, as so we keep this text as is.

      476-489: [19] conceded in their section 4.1 (pp. 11-12) that the atlas pleurocentrum, though fused to the dorsal surface of the axis intercentrum as usual for amniotes and diadectomorphs, was not fused to the axis pleurocentrum.

      This is correct, as we note in the MS. The issue is whether these elements are clearly identifiable.

      506-510: [19:12] did identify what they considered a possible ulnar patella, illustrated it (Figure 4d), scored it as unknown, and devoted the entire section 4.4 to it.<br /> 512-523: What I find most striking is that Whiteside et al., having just discovered a new taxon, feel so certain that this is the last one and any further material from that fissure must be referable to one of the species now known from there.

      We agree with these points and believe we have devoted adequate text to addressing them. Note that the reviewer does not recommend any revisions to these sections.

      553: Not that it matters, but I'm surprised you didn't use TNT 1.6; it came out in 2023 and is free like all earlier versions.

      We have kept this as is following the reviewer comment, and because we were interested in replicating the analyses in the previous publications that have contributed to the debate about the identity of this taxon.  For the present simple analyses both versions should perform identically, as the search algorithms for discrete characters are identical across these versions.

      562: Is "01" a typo, or do you mean "0 or 1"? In that case, rather write "0/1" or "{01}".

      This has been corrected to {01}

      (3) Comments on nomenclature and terminology

      55, 56: Delete both "...".

      This has been corrected.

      100: "ent- and ectepicondylar"

      For clarity, we have kept the full words.

      107-108: I understand that "high" is proximal and "low" is distal, but what is "the distal surface" if it is not the articular surface in the elbow joint?

      This has been corrected.

      120: "stem pan-lepidosaurs, and stem pan-squamates"; Lepidosauria and Squamata are crown groups that don't contain their stems

      This has been corrected.

      122, 123: Italics for Claudiosaurus and Delorhynchus.

      This has been corrected.

      130: Insert a space before "Tianyusaurus" (it's there in the original), and I recommend de-italicizing the two genus names to keep the contrast (as you did in line 162).

      This has been corrected.

      130, 131: Replace both "..." by "[...]", though you can just delete the second one.

      This has been corrected.

      174: Not a capitulum, but a grammatically even smaller (double diminutive) capitellum.

      This has been corrected.

      209, 224, Table 1: Both teams have consistently been doing this wrong. It's "recessus scalae tympani". The scala tympani ("ladder/staircase of the [ear]drum") isn't the recess, it's what the recess is for; therefore, the recess is named "recess of the scala tympani", and because there was no word for "of" in Classical Latin ("de" meant "off" and "about"), the genitive case was the only option. (For the same reason, the term contains "tympani", the genitive of "tympanum".)

      This has been corrected.

      415-425: This is a terminological nightmare. Ribs can have (and I'm not sure this is exhaustive): a) two separate processes (capitulum, tuberculum) that each bear an articulating facet, and a notch in between; b) the same, but with a non-articulating web of bone connecting the processes; c) a single uninterrupted elongate (even angled) articulating facet that articulates with the sutured or fused dia- and parapophysis; d) a single round articulating facet. Certainly, a) is bicapitate and d) is unicapitate, but for b) and c) all bets are off as to how any particular researcher is going to call them. This is a known source of chaos in phylogenetic analyses. I recommend writing a sentence or three on how the terms "unicapitate" & "bicapitate" lack fixed meanings and have caused confusion throughout tetrapod phylogenetics, and that the condition seen in Cryptovaranoides is nonetheless identical to that in archosauromorphs.

      This has been added: “This confusion in part stems from the lack of a fixed meaning for uni- and bicapitate rib heads; in any case, †C. microlanius possesses a condition identical to archosauromorphs as we have shown.”  (L.475-477, P.16).

      439-440: Other than in archosaurs, some squamates and Mesosaurus, in which sauropsids are dorsal intercentra absent?

      We are unclear about the relevance of the question to this section. The issue at hand is that some squamate lineages possess dorsal intercentra, so the absence of dorsal intercentra cannot be considered a squamate synapomorphy without the optimization of this feature along a phylogeny (which was not accomplished by Whiteside et al.).

      458: prezygapophyses.

      This has been corrected.

      516: "[...]".

      This has been corrected.

      566: synapomorphies.

      This has been corrected.

      587: Macrocnemus.

      This has been corrected.

      585: I strongly recommend either taking off and nuking the name Reptilia from orbit (like Pisces) or using it the way it is defined in Phylonyms, namely as the crown group (a subset of Neodiapsida). Either would mean replacing "neodiapsid reptiles" with "neodiapsids".

      This has been corrected to “neodiapsids.”

      625: Replace "inclusive clades" by "included clades", "component clades", "subclades", or "parts," for example.

      This has been kept as is because “inclusive clades” is common terminology and is used extensively in, for example, the PhyloCode. 

      659: Please update.

      References are updated.

      Fig. 8: Typo in Puercosuchus.

      This has been corrected.

      (4) Comments on style and spelling

      You inconsistently use the past and the present tense to describe [13, 19], sometimes both in the same sentence (e.g., lines 323 vs. 325). I recommend speaking of published papers in the past tense to avoid ascribing past views and acts to people in their present state.

      This has been corrected to be more consistent throughout the manuscript.

      48: Remove the second comma.

      This has been corrected.

      91: Replace "[13] and WEA24" by "[13, 19]".

      This has been corrected.

      100: Commas on both sides of "in fact" or on neither

      This has been corrected.

      117: I recommend "the interpretation in [19]". I have nothing against the abbreviation "WEA24", but you haven't defined it, and it seems like a remnant of incomplete editing. - That said, eLife does not impose a format on such things. If you prefer, you can just bring citation by author & year back; in that case, this kind of abbreviation would make perfect sense (though it should still be explicitly defined).<br /> 129, 145: Likewise.

      We have modified this [13] and [19] where necessary.

      192-198: Surely this should be made part of the paragraph in lines 158-175, which has the exact same headline?

      This has been corrected.

      200-206: Surely this should be made part of the paragraph in lines 148-156, which has the exact same headline?

      These sections deal with different issues pertaining to the analyses of Whiteside et al. (2024) and so we have kept to organization as is.

      214: Delete "that".

      This has been deleted.

      312: "Vomer" isn't an adjective; I'd write "main vomer body" or "vomer's main body" or "main body of the vomer".

      This has been corrected.

      350: "figured"

      This has been corrected.

      400: Rather, "rearticulated" or "worked to rearticulate"? - And why "several"? Just write "two". "Several" implies larger numbers.

      These issues have been corrected.

      448, 500: As which? As what kind of feature? I'm aware that "as such" is fairly widely used for "therefore", but it still confuses me every time, and I have to suspect I'm not the only one. I recommend "therefore" or "for this reason" if that is what you mean.

      “As such” has been deleted.

      452: Adobe Reader doesn't let me check, but I think you have two spaces after "of".

      This has been corrected.

      514, 539, 546, 552, 588, Fig. 3, 5, 6, Table 1: "WEA24" strikes again.

      This has been corrected.

      515: Remove the parentheses.

      This has been corrected.

      531: Insert a space after the period.

      This has been corrected.

      532: Remove both commas and the second "that".

      This has been corrected.

      538: Remove the comma.

      This has been kept as is because changing it would render the sentence grammatically incorrect.

      545: "[...]" or, better, nothing.

      This has been corrected.

      547: Spaces on both sides of the dash or on neither (as in line 553).

      This has been corrected.

      552: Rather, "conducted a parsimony analysis".

      This has been corrected.

      556: Space after "[19]".

      This has been corrected.

      560: Comma after "narrow".

      This has been corrected.

      600: Comma after "above" to match the one in the preceding line - there's an insertion in the sentence that must be flanked by commas on both sides.

      This has been corrected.

      603: Compound adjectives like "alpha-taxonomic" need a hyphen to avoid tripping readers up.

      This has been corrected.

      612: Similarly, "ancestral-state reconstruction" needs one to make immediately clear it isn't a state reconstruction that is ancestral but a reconstruction of ancestral states.

      This has been corrected.

      613: If you want to keep this comma, you need to match it with another after "Cryptovaranoides" in line 611.

      We have kept this as is, because removing this comma would render the sentence grammatically incorrect.

      615: Likewise, you need a comma after "and" because "except for a few features" is an insertion. The other comma is actually optional; it depends on how much emphasis you want to place on what comes after it.

      this has been added.

      622: Comma after "[48, 49]".

      this has been added.

      672: Missing italics and two missing spaces.

      This has been corrected.

      678, 680-681, 693, 700-701, 734, 742, 747, 788, 797, 799, 803, 808, 810-811, 814, 817, 820, 823, 828, 841, 843: Missing italics.

      This has been corrected.

      683, 689: These are book chapters. Cite them accordingly.

      This has been corrected.

      737: Missing DOI.

      No DOI is available.

      793: Missing Bolosaurus major; and I'd rather cite it as "2024" than "in press", and "online early" instead of "n/a".

      This has been corrected.

      835: Hoffstetter, RJ?

      This has been corrected.

      836: Is there something missing?

      This has been corrected.

      839: This is the same reference as number 20 (lines 683-684), and it is miscited in a different way...!

      This has been corrected.

      Reviewer #2 (Recommendations for the authors):

      (1) There is a brief mention of a phylogenetic analysis being re-run, but it is unclear if any modifications (changes in scoring) based on the very observations were made. Please state this explicitly.

      This is explained from lines 600-622, P.20-21, in the section “Apomorphic characters not empirically obtained.”  "In order to check the characters listed by Whiteside et al. [19] (p.19) as “two diagnostic characters” and “eight synapomorphies” in support of a squamate identity for †Cryptovaranoides, we conducted a parsimony analysis of the revised version of the dataset [32] provided by Whiteside et al. [19] in TNT v 1.5 [91]. We used Whiteside et al.’s [19] own data version"

      (2) Line 20: There is almost no discussion of non‑lepidosaur lepidosauromorphs. I suggest including this, as the archosauromorph‑like features reported in Cryptovaranoides appear rather plastic. Furthermore, diagnostic features of Archosauromorpha in other datasets (e.g., Ezcurra 2016 or the works of Spiekman) are notably absent (and unsampled) in Cryptovaranoides. Expanding this comparison would greatly strengthen the manuscript.

      The brief discussion (although not absent) of non-lepidosaur lepidosauromorphs is largely a function of the poor fossil record of this grade. But where necessary, we do discuss these taxa. Also see our previous study (Brownstein et al. 2023) for an extensive discussion of characters relevant to archosauromorphs.

      (3) Line 38: I suggest removing "Archosauromorpha" from the keywords. The authors make a compelling case that Cryptovaranoides is not a squamate, yet they do not fully test its placement within Archosauromorpha (as they acknowledge). Perhaps use "Reptilia" instead?

      We have removed this keyword.

      (4) Line 99: The authors' points here are well made and largely valid. The presence of the ent‑ and ectepicondylar foramina is indeed an amniote plesiomorphy and cannot confirm a squamate identity. Their absence, however, can be informative - although it is unclear whether the CT scans of the humerus are of sufficient resolution, and Figure 4 of Brownstein et al. looks hastily reconstructed (perhaps owing to limited resolution). Moreover, the foramina illustrated by Whiteside do resemble those of other reptiles, albeit possibly over‑prepared and exaggerated.

      The issue with the noted figure is indeed due to poor resolution from the scans. Although we agree with the reviewer, we hesitate to talk about absence in this taxon being phylogenetically informative given the confounding influence of ontogeny.

      (5) I encourage the authors to provide slice data to support the claim that the foramina are absent (which could certainly be correct!); otherwise, the assertion remains unsubstantiated.

      We only have access to the mesh files of segmented bones, not the raw (reconstructed slice) data.

      (6) PLEASE NOTE - because the specimen is juvenile, the apparent absence of the ectepicondylar foramen is equivocal: the supinator process develops through ontogeny and encloses this foramen (see Buffa et al. 2025 on Thadeosaurus, for example).

      See above.

      (7) Line 122: Italicize 'Delorhynchus'

      This has been corrected.

      (8) Lines 131‑132: I'd suggest deleting the final sentence; it feels a little condescending, and your argument is already persuasive.

      This has been corrected.

      (9) Line 129: Please note that owenettid "parareptiles" also lack this process, as do several other stem‑saurians. Its absence is therefore not diagnostic of Squamata.<br /> Also: Such plasticity is common outside the crown. Milleropsis and Younginidae develop this process during ontogeny, even though a lower temporal bar never fully forms.

      We appreciate this point. See discussion later in the manuscript.

      (11) Line 172: Consider adding ontogeny alongside taphonomy and preservation. A juvenile would likely have a poorly developed radial condyle, if any. Acknowledging this possibility will add some needed nuance.

      This sentence has been modified, but we have not added in discussion of ontogeny here because it is not immediately relevant to refuting the argument about inference of the presence of this feature when it is not preserved.

      (12) Line 177: The "septomaxilla" in Whiteside et al. (2024, Figure 1C) resembles the contralateral premaxilla in dorsal view, with the maxillary process on the left and the palatal (or vomerine) process on the right (the dorsal process appears eroded). The foramen looks like a prepalatal foramen, common to many stem and crown reptiles. Consequently, scoring the septomaxilla as absent may be premature; this bone often ossifies late. In my experience with stem‑reptile aggregations, only one of several articulated individuals may ossify this element.

      We agree that presence of a late-ossifying septomaxilla cannot be ruled out, but our point remains (and in agreement with Referee) that scoring the septomaxilla as present based on the amorphous fragments is premature.

      (13) Line 200: Tomography data should be shown before citing it. The posterior margin of the maxilla appears rather straight, and the maxilla itself is tall for an archosauromorph. It would be more convincing to score this feature as present only after illustrating the relevant slices - and, as you note, the trait is widespread among non‑archosauromorphs.

      See above and Brownstein et al. (2023).

      (14) Line 208: Well argued: how could Whiteside et al. confidently assign a disarticulated element? Their "vagus" foramen actually resembles a standard hypoglossal foramen - identical to that seen in many stem reptiles, which often have one large and one small opening.

      Thank you!

      (15) Line 248: Again, please illustrate this region. One cannot argue for absence without showing the slice data. Note that millerettids and procolophonians - contemporaneous with Cryptovaranoides - possess an enclosed vidian canal, so the feature is broadly distributed.

      See above.

      (16) Line 258: The choanal fossa is intriguing: originally created for squamate matrices, yet present (to varying degrees) in nearly every reptile I have examined. It is strongly developed in millerettids (see Jenkins et al. 2025 on Milleropsis and Milleretta) and younginids, much like in squamates - Tiago appropriately scores it as present. Thus, it may be more of a "Neodiapsida + millerettids" character. In any case, the feature likely forms an ordered cline rather than a simple binary state.

      We agree and look forward to future study of this feature.

      (17) Line 283: Bolosaurids are not diapsids and, per Simões, myself, and others, "Diapsida" is probably invalid, at least how it is used here. Better to say "neodiapsids" for choristoderes and "stem‑reptiles" or "sauropsids" for bolosaurids. Jenkins et al.'s placement is largely a function of misidentifying the bolosaurid stapes as the opisthotic.

      We are not entirely clear on this point since bolosaurids are not mentioned in this section.

      (18) Line 298: Here, you note that the CT scans are rather coarse, which makes some earlier statements about absence/presence less certain (e.g., humeral foramina). It may strengthen the paper to make fewer definitive claims where resolution limits interpretation.

      We appreciate this point. However, in the case of the humeral foramina the coarseness of the scans is one reason why we question Whiteside et al. scoring of the presence of these features.

      (19) Line 314: Multiple rows of vomerine teeth are standard for amniotes; lepidosauromorphs such as Paliguana and Megachirella also exhibit them (though they may not have been segmented in the latter's description). Only a few groups (e.g., varanopids, some millerettids) have a single medial row.

      We appreciate this point and have added in those citations into the following added sentence: “Multiple rows of vomerine teeth are common in reptiles outside of Squamata [76]; the presence of only one row is restricted to a handful of clades, including millerettids [77,78], †Tanystropheus [49], and some [79], but not all [71,80] choristoderes.” (L. 360-363, P. 12).

      (20) Line 317: This is likely a reptile plesiomorphy - present in all millerettids (e.g., Milleropsis and Milleretta per Jenkins et al.). Citing these examples would clarify that it is not uniquely squamate. Could it be secondarily lost in archosauromorphs?

      We appreciate this point and have cited Jenkins et al. here. It is out of the scope of this discussion to discuss the polarity of this feature relative to Archosauromorpha.

      (21) Line 336: Unfortunately, a distinct quadratojugal facet is usually absent in Neodiapsids and millerettids; where present, the quadratojugal is reduced and simply overlaps the quadrate.

      We appreciate this point but feel that reviewing the distribution of this feature across all reptiles is not relevant to the text noted.

      (22) Line 357: Pterygoid‑quadrate overlap is likely a tetrapod plesiomorphy. Whiteside et al. do not define its functional or phylogenetic significance, and the overlap length is highly variable even among sister taxa.

      We agree, but in any case this feature is impossible to assess in Cryptovaranoides.

      (23) Line 365: Another well‑written section - clear and persuasive.

      Thank you!

      (24) Line 385: The cephalic condyle is widespread among neodiapsids, so it is not uniquely squamate.

      We agree.

      (25) Character 391: Note that the frontal underlapping the parietal is widespread, appearing in both millerettids and neodiapsids such as Youngina.

      We appreciate this point, but the point here deals with the fact that this feature is not observable in the holotype of Cryptovaranoides.

      (26) Line 415: The "anterior process" is actually common among crown reptiles, including sauropterygians, so it cannot by itself place Cryptovaranoides within Archosauromorpha.

      We agree but also note that we do not claim this feature unambiguously unites Cryptovaranoides with Archosauromorpha.

      (28) Line 460: Yes - Whiteside et al. appear to have relabeled the standard amniote coracoid foramen. Excellent discussion.

      Thank you!

      (29) Line 496: While mirroring Whiteside's structure, discussing this mandibular character earlier, before the postcrania, might aid readability.

      We have chosen to keep this structure as is.

      (30) Lines 486-588: This section oversimplifies the quadrate articulation.

      We are unclear how this is an oversimplification.

      (31) Both Prolacerta and Macrocnemus possess a cephalic condyle and some mobility (though less than many squamates). In Prolacerta (Miedema et al. 2020, Figure 4), the squamosal posteroventral process loosely overlaps the quadrate head.

      We assume this comment refers to the section "Peg-in-notch articulation of quadrate head"; we appreciate clarification that this feature occurs in variable extent outside squamates, but this does not affect our statement that the material of Cryptovaranoides is too poorly preserved to confirm its presence.

      (32) Where is this process in Cryptovaranoides? It is not evident in Whiteside's segmentation of the slender squamosal - please illustrate.

      We are unclear as to which section this comment refers.

      (33) Additionally, the quadrate "conch" of Cryptovaranoides is well developed, bearing lateral and medial tympanic crests; the lateral crest is absent in the cited archosauromorphs.

      We note that no vertebrate has a medial tympanic crest (it is always laterally placed for the tympanic membrane, when present). If this is what the reviewer refers to, this is a feature commonly found across all tetrapods bearing a tympanum attached to the quadrate (e.g., most reptiles), and so it is not very relevant phylogenetically. Regarding its presence in Cryptovaranoides, the lateral margin of the quadrate is broken (Brownstein et al., 2023), so it cannot be determined. This incomplete preservation also makes an interpretation of a quadrate conch very hard to determine. But as currently preserved, there is no evidence whatsoever for this feature.

      (34) Line 591: The cervical vertebrae of Cryptovaranoides are not archosauromorph‑like. Archosauromorph cervicals are elongate, parallelogram‑shaped, and carry long cervical ribs-none of which apply here. As the manuscript lacks a phylogenetic analysis, including these features seems unnecessary. Should they be added to other datasets, I suspect Cryptovaranoides would align along the lepidosaur stem (though that remains to be tested).

      We politely disagree. The reviewer here mentions that the cervical vertebrae of archosauromorphs are generally shaped differently from those in Cryptovaranoides. The description provided (“elongate, parallelogram‑shaped, and carry long cervical ribs-none”) is basically limited to protorosaurians (e.g., tanystropheids, Macrocnemus) and early archosauriforms. We note that archosauromorph cervicals are notoriously variable in shape, especially in the crown, but also among early archosauromorphs. Further, the cervical ribs, are notoriously similar among early archosauromorphs (including protorosaurians) and Cryptovaranoides, as discussed and illustrated in Brownstein et al., 2023 (Figs. 2 and 3), especially concerning the presence of the anterior process.

      Further, we do include a phylogenetic analysis of the matrix provided in Whiteside et al. (2024) as noted in our results section. In any case, we direct the reviewer to our previous study (Brownstein et al., 2023), in which we conduct phylogenetic analyses that included characters relevant to this note.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should use specimen numbers all over the text because we are talking about multiple individuals, and the authors contest the previous affinity of some of them. For example, on page 16, line 447, they mention an isolated vertebra but without any number. The specimen can be identified in the referenced article, but it would be much easier for the reader if the number were also provided here

      Agreed and added.

      (2) Abstract: "Our team questioned this identification and instead suggested Cryptovaranoides had unclear affinities to living reptiles."

      That is very imprecise. The team suggested that it could be an archosauromorph or an indeterminate neodiapsid. Please change accordingly.

      We politely disagree. We stated in our 2023 study that whereas our phylogenetic analyses place this taxon in Archosauromorpha, it remains unclear where it would belong within the latter. This is compatible with “unclear affinities to living reptiles”.

      (3) Page 7, line 172: "Taphonomy and poor preservation cannot be used to infer the presence of an anatomical feature that is absent." Unfortunate wording. Taphonomy always has to be used to infer the presence or absence of anatomical features. Sometimes the feature is not preserved, but it leaves imprints/chemical traces or other taphonomic indicators that it was present in the organism. Please remove or rewrite the sentence.

      We agree and have modified the sentence to read: “Taphonomy and poor preservation cannot be used alone to justify the inference that an anatomical feature was present when it is not preserved and there is no evidence of postmortem damage. In a situation when the absence of a feature is potentially ascribable to preservation, its presence should be considered ambiguous.” (L. 141-145, P.5).

      (4) Page 4, line 91, please explain "WEA24" here, though it is unclear why this abbreviation is used instead of citation in the manuscript.

      This has been corrected to Whiteside et al. [19].

      (5) Page 6, line 144: "Together, these observations suggest that the presence of a jugal posterior process was incorrectly scored in the datasets used by WEA24 (type (ii) error)." That sentence is unclear. Why did the authors use "suggest"? Does it mean that they did not have access to the original data matrix to check it? If so, it should be clearly stated at the beginning of the manuscript.

      See earlier; this has been modified and “suggest” has been removed.

      (6) Page 7, line 174: "Finally, even in the case of the isolated humerus with a preserved capitulum, the condyle illustrated by Whiteside et al. [19] is fairly small compared to even the earliest known pan-squamates, such as Megachirella wachtleri (Figure 4)." Figure 4 does not show any humeri. Please correct.

      The reference to figure 4 has been removed.

      (7) Page 8, line 195-198: "This is not the condition specified in either of the morphological character sets that they cite [18,38], the presence of a distinct condyle that is expanded and is by their own description not homologous to the condition in other squamates." This is a bit unclear. Could the authors explain it a little bit further? How is the condition that is specified in the referred papers different compared to the Whiteside et al. description?

      We appreciate this comment and have broken this sentence up into three sentences to clarify what we mean:

      “The projection of the radial condyle above the adjacent region of the distal anterior extremity is not the condition specified in either of the morphological character sets that Whiteside et al. [19] cite [18,32]. The condition specified in those studies is the presence of a distinct condyle that is expanded. The feature described in Whiteside et al. [19] does not correspond to the character scored in the phylogenetic datasets.” (L.220-225, P.8).

      (8) Page 16, line 446: "they observed in isolated vertebrae that they again refer to C. microlanius without justification". That is not true. The referred paper explains the attribution of these vertebrae to Cryptovaranoides (see section 5.3 therein). The authors do not have to agree with that justification, but they cannot claim that no justification was made. Please correct it here and throughout the text.

      We have modified this sentence but note that the justification in Whiteside et al. (2024) lacked rigor. Whiteside et al. (2024) state: “Brownstein et al. [5] contested the affinities of three vertebrae, cervical vertebra NHMUK PV R37276, dorsal vertebra NHMUK PV R37277 and sacral vertebra NHMUK PV R37275. While all three are amphicoelous and not notochordal, the first two can be directly compared to the holotype. Cervical vertebra NHMUK PV R37276 is of the same form as the holotype CV3 with matching neural spine, ventral keel (=crest) and the posterior lateral ridges or lamina (figure 3c,d) shown by Brownstein et al. [5, fig. 1a]. The difference is that NHMUK PV R37276 has a fused neural arch to the pleurocentrum and a synapophysis rather than separate diapophysis and parapophysis of the juvenile holotype (figure 3c). Neurocentral fusion of the neural arch and centrum can occur late in modern squamates, ‘up to 82% of the species maximum size’ [28].

      The dorsal surface of dorsal vertebra NHMUK PV R37277 (figure 3e) can be matched to the mid-dorsal vertebra in the †Cryptovaranoides holotype (figure 4d, dor.ve) and has the same morphology of wide, dorsally and outwardly directed, prezygapophyses, downwardly directed postzygapophyses and similar neural spine. It is also of similar proportions to the holotype when viewed dorsally (figures 3e and 4d), both being about 1.2 times longer anteroposteriorly than they are wide, measured across the posterior margin. The image in figure 4d demonstrates that the posterior vertebrae are part of the same spinal column as the truncated proximal region but the spinal column between the two parts is missing, probably lost in quarrying or fossil collection.”

      This justification is based on pointing out the presence of supposed shared features between these isolated vertebrae and those in the holotype of Cryptovaranoides, even though none of these features are diagnostic for that taxon. We have changed the sentence in our manuscript to read:

      “Whiteside et al. [19] concur with Brownstein et al. [18] that the diapophyses and parapophyses are unfused in the anterior dorsals of the holotype of †Cryptovaranoides microlanius, and restate that fusion of these structures is based on the condition they observed in isolated vertebrae that they refer to †C. microlanius based on general morphological similarity and without reference to diagnostic characters of †C. microlanius” (L. 502-507, P. 17).

      (9) Figure 2. The figure caption lacks some explanations. Please provide information about affinity (e.g., squamate/gekkotan), ag,e and locality of the taxa presented. Are these left or right palatines? The second one seems to be incomplete, and maybe it is worth replacing it with something else?

      The figure caption has been modified:

      “Figure 2. Comparison of palatine morphologies. Blue shading indicates choanal fossa. Top image of †Cryptovaranoides referred left palatine is from Whiteside et al. [19]. Middle is the left palatine of †Helioscopos dickersonae (Squamata: Pan-Gekkota) from the Late Jurassic Morrison Formation [62]. Bottom is the right palatine of †Eoscincus ornatus (Squamata: Pan-Scincoidea) from the Late Jurassic Morrison Formation [31].”

      (10) Figure 8. The abbreviations are not explained in the figure caption.

      These have been added.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Introduction & Theory

      (1) It is difficult to appreciate why the first trial of extinction in a standard protocol does NOT produce the retrieval-extinction effect. This applies to the present study as well as others that have purported to show a retrieval-extinction effect. The importance of this point comes through at several places in the paper. E.g., the two groups in Study 1 experienced a different interval between the first and second CS extinction trials; and the results varied with this interval: a longer interval (10 min) ultimately resulted in less reinstatement of fear than a shorter interval. Even if the different pattern of results in these two groups was shown/known to imply two different processes, there is nothing in the present study that addresses what those processes might be. That is, while the authors talk about mechanisms of memory updating, there is little in the present study that permits any clear statement about mechanisms of memory. The references to a "short-term memory update" process do not help the reader to understand what is happening in the protocol.

      We agree with the reviewer that whether and how the retrieval-extinction paradigm works is still under debate. Our results provide another line of evidence that such a paradigm is effective in producing long term fear amnesia. The focus of the current manuscript is to demonstrate that the retrieval-extinction paradigm can also facilitate a short-term fear memory deficit measured by SCR. Our TMS study provided some preliminary evidence in terms of the brain mechanisms involved in the causal relationship between the dorsolateral prefrontal cortex (dlPFC) activity and the short-term fear amnesia and showed that both the retrieval interval and the intact dlPFC activity were necessary for the short-term fear memory deficit and accordingly were referred to as the “mechanism” for memory update. We acknowledge that the term “mechanism” might have different connotations for different researchers. We now more explicitly clarify what we mean by “mechanisms” in the manuscript (line 99) as follows:

      “In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      In reply to this point, the authors cite evidence to suggest that "an isolated presentation of the CS+ seems to be important in preventing the return of fear expression." They then note the following: "It has also been suggested that only when the old memory and new experience (through extinction) can be inferred to have been generated from the same underlying latent cause, the old memory can be successfully modified (Gershman et al., 2017). On the other hand, if the new experiences are believed to be generated by a different latent cause, then the old memory is less likely to be subject to modification. Therefore, the way the 1stand 2ndCS are temporally organized (retrieval-extinction or standard extinction) might affect how the latent cause is inferred and lead to different levels of fear expression from a theoretical perspective." This merely begs the question: why might an isolated presentation of the CS+ result in the subsequent extinction experiences being allocated to the same memory state as the initial conditioning experiences? This is not yet addressed in any way.

      As in our previous response, this manuscript is not about investigating the cognitive mechanism why and how an isolated presentation of the CS+ would suppress fear expression in the long term. As the reviewer is aware, and as we have addressed in our previous response letters, both the positive and negative evidence abounds as to whether the retrieval-extinction paradigm can successfully suppress the long-term fear expression. Previous research depicted mechanisms instigated by the single CS+ retrieval at the molecular, cellular, and systems levels, as well as through cognitive processes in humans. In the current manuscript, we simply set out to test that in addition to the long-term fear amnesia, whether the retrieval-extinction paradigm can also affect subjects’ short-term fear memory.

      (2) The discussion of memory suppression is potentially interesting but, in its present form, raises more questions than it answers. That is, memory suppression is invoked to explain a particular pattern of results but I, as the reader, have no sense of why a fear memory would be better suppressed shortly after the retrieval-extinction protocol compared to the standard extinction protocol; and why this suppression is NOT specific to the cue that had been subjected to the retrieval-extinction protocol.

      Memory suppression is the hypothesis we proposed that might be able to explain the results we obtained in the experiments. We discussed the possibility of memory suppression and listed the reasons why such a mechanism might be at work. As we mentioned in the manuscript, our findings are consistent with the memory suppression mechanism on at least two aspects: 1) cue-independence and 2) thought-control ability dependence. We agree that the questions raised by the reviewer are interesting but to answer these questions would require a series of further experiments to disentangle all the various variables and conceptual questions about the purpose of a phenomenon, which we are afraid is out of the scope of the current manuscript. We refer the reviewer to the discussion section where memory suppression might be the potential mechanism for the short-term amnesia we observed (lines 562-569) as follows:

      “Previous studies indicate that a suppression mechanism can be characterized by three distinct features: first, the memory suppression effect tends to emerge early, usually 10-30 mins after memory suppression practice and can be transient (MacLeod and Macrae, 2001; Saunders and MacLeod, 2002); second, the memory suppression practice seems to directly act upon the unwanted memory itself (Levy and Anderson, 2002), such that the presentation of other cues originally associated with the unwanted memory also fails in memory recall (cue-independence); third, the magnitude of memory suppression effects is associated with individual difference in control abilities over intrusive thoughts (Küpper et al., 2014).”

      (3) Relatedly, how does the retrieval-induced forgetting (which is referred to at various points throughout the paper) relate to the retrieval-extinction effect? The appeal to retrieval-induced forgetting as an apparent justification for aspects of the present study reinforces points 2 and 3 above. It is not uninteresting but lacks clarification/elaboration and, therefore, its relevance appears superficial at best.

      We brought the topic of retrieval-induced forgetting (RIF) to stress the point that memory suppression can be unconscious. In a standard RIF paradigm, unlike the think/no-think paradigm, subjects are not explicitly told to suppress the non-target memories. However, to successfully retrieve the target memory, the cognitive system actively inhibits the non-target memories, effectively implementing a memory suppression mechanism (though unconsciously). Therefore, it is possible our results might be explained by the memory suppression framework. We elaborated this point in the discussion section (lines 578-584): 

      “In our experiments, subjects were not explicitly instructed to suppress their fear expression, yet the retrieval-extinction training significantly decreased short-term fear expression. These results are consistent with the short-term amnesia induced with the more explicit suppression intervention (Anderson et al., 1994; Kindt and Soeter, 2018; Speer et al., 2021; Wang et al., 2021; Wells and Davies, 1994). It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious.”

      (4) I am glad that the authors have acknowledged the papers by Chalkia, van Oudenhove & Beckers (2020) and Chalkia et al (2020), which failed to replicate the effects of retrieval-extinction reported by Schiller et al in Reference 6. The authors have inserted the following text in the revised manuscript: "It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literature, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause." Firstly, if it is beyond the scope of the present study to discuss the discrepancies between the present and past results, it is surely beyond the scope of the study to make any sort of reference to clinical implications!!!

      As we have clearly stated in our manuscript that this paper was not about discussing why some literature was or was not able to replicate the retrieval-extinction results originally reported by Schiller et al. 2010. Instead, we aimed to report a novel short-term fear amnesia through the retrieval-extinction paradigm, above and beyond the long-term amnesia reported before. Speculating about clinical implications of these finding is unrelated to the long-term, amnesia debate in the reconsolidation world. We now refer the reader to several perspectives and reviews that have proposed ways to resolve these discrepancies as follows (lines 642-673).

      Secondly, it is perfectly fine to state that "the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause..." This is not uninteresting, but it also isn't saying much. Minimally, I would expect some statement about factors that are likely to determine whether one is or isn't likely to see a retrieval-extinction effect, grounded in terms of this theory.

      Again, as we have responded many times, we simply do not know why some studies were able to suppress the fear expression using the retrieval-extinction paradigm and other studies weren’t. This is still an unresolved issue that the field is actively engaging with, and we now refer the reader to several papers dealing with this issue. However, this is NOT the focus of our manuscript. Having a healthy debate does not mean that every study using the retrieval-extinction paradigm must address the long-standing question of why the retrieval-extinction paradigm is effective (at least in some studies).

      Clarifications, Elaborations, Edits

      (5) Some parts of the paper are not easy to follow. Here are a few examples (though there are others):

      (a) In the abstract, the authors ask "whether memory retrieval facilitates update mechanisms other than memory reconsolidation"... but it is never made clear how memory retrieval could or should "facilitate" a memory update mechanism.

      We meant to state that the retrieval-extinction paradigm might have effects on fear memory, above and beyond the purported memory reconsolidation effect. Sentence modified (lines 25-26) as follows:

      “Memory reactivation renders consolidated memory fragile and thereby opens the window for memory updates, such as memory reconsolidation.”

      (b) The authors state the following: "Furthermore, memory reactivation also triggers fear memory reconsolidation and produces cue specific amnesia at a longer and separable timescale (Study 2, N = 79 adults)." Importantly, in study 2, the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction. This result is interesting but cannot be easily inferred from the statement that begins "Furthermore..." That is, the results should be described in terms of the combined effects of retrieval and extinction, not in terms of memory reactivation alone; and the statement about memory reconsolidation is unnecessary. One can simply state that the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction.

      The sentence the reviewer referred to was in our original manuscript submission but had since been modified based on the reviewer’s comments from last round of revision. Please see the abstract (lines 30-35) of our revised manuscript from last round of revision:

      “Furthermore, across different timescales, the memory retrieval-extinction paradigm triggers distinct types of fear amnesia in terms of cue-specificity and cognitive control dependence, suggesting that the short-term fear amnesia might be caused by different mechanisms from the cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults).”

      (c) The authors also state that: "The temporal scale and cue-specificity results of the short-term fear amnesia are clearly dissociable from the amnesia related to memory reconsolidation, and suggest that memory retrieval and extinction training trigger distinct underlying memory update mechanisms." ***The pattern of results when testing occurred just minutes after the retrieval-extinction protocol was different to that obtained when testing occurred 24 hours after the protocol. Describing this in terms of temporal scale is unnecessary; and suggesting that memory retrieval and extinction trigger different memory update mechanisms is not obviously warranted. The results of interest are due to the combined effects of retrieval+extinction and there is no sense in which different memory update mechanisms should be identified with the different pattern of results obtained when testing occurred either 30 min or 24 hours after the retrieval-extinction protocol (at least, not the specific pattern of results obtained here).

      Again, we are afraid that the reviewer referred to the abstract in the original manuscript submission, instead of the revised abstract we submitted in the last round. Please see lines 37-39 of the revised abstract where the sentence was already modified (or the abstract from last round of revision).

      The facts that the 30min, 6hr and 24hr test results are different in terms of their cue-specificity and thought-control ability dependence are, to us, an important discovery in terms of delineating different cognitive processes at work following the retrieval-extinction paradigm. We want to emphasize that the fear memories after going through the retrieval-extinction paradigm showed interesting temporal dynamics in terms of their magnitudes, cue-specificity and thought-control ability dependence.

      (d) The authors state that: "We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory update mechanisms following extinction training, and these mechanisms can be further disentangled through the lens of temporal dynamics and cue-specificities." *** The first part of the sentence is confusing around usage of the term "facilitate"; and the second part of the sentence that references a "lens of temporal dynamics and cue-specificities" is mysterious. Indeed, as all rats received the same retrieval-extinction exposures in Study 2, it is not clear how or why any differences between the groups are attributed to "different memory update mechanisms following extinction"

      The term “facilitate” was used to highlight the fact that the short-term fear amnesia effect is also memory retrieval dependent, as study 1 demonstrated. The novelty of the short-term fear memory deficit can be distinguished from the long-term memory effect via cue-specificity and thought-control ability dependence. Sentence has been modified (lines 97-101) as follows:

      “We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory deficits following extinction training, and these deficits can be further disentangled through the lens of temporal dynamics and cue-specificities. In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      Data

      (6A) The eight participants who were discontinued after Day 1 in Study 1 were all from the no reminder group. The authors should clarify how participants were allocated to the two groups in this experiment so that the reader can better understand why the distribution of non-responders was non-random (as it appears to be).

      (6B) Similarly, in study 2, of the 37 participants that were discontinued after Day 2, 19 were from Group 30 min and 5 were from Group 6 hours. The authors should comment on how likely these numbers are to have been by chance alone. I presume that they reflect something about the way that participants were allocated to groups: e.g., the different groups of participants in studies 1 and 2 could have been run at quite different times (as opposed to concurrently). If this was done, why was it done? I can't see why the study should have been conducted in this fashion - this is for myriad reasons, including the authors' concerns re SCRs and their seasonal variations.

      As we responded in the previous response letters (as well as in the revised the manuscript), subjects were excluded because their SCR did not reach the threshold of 0.02 S when electric shock was applied. Subjects were assigned to different treatments daily (eg. Day 1 for the reminder group and Day 2 for no-reminder group) to avoid potential confusion in switching protocols to different subjects within the same day. We suspect that the non-responders might be related to the body thermal conditions caused by the lack of central heating for specific dates. Please note that the discontinued subjects (non-responders) were let go immediately after the failure to detect their SCR (< 0.02 S) on Day 1 and never invited back on Day 2, so it’s possible that the discontinued subjects were all from certain dates on which the body thermal conditions were not ideal for SCR collection. Despite the number of excluded subjects, we verified the short-term fear amnesia effect in three separate studies, which to us should serve as strong evidence in terms of the validity of the effect.

      (6C) In study 2, why is responding to the CS- so high on the first test trial in Group 30 min? Is the change in responding to the CS- from the last extinction trial to the first test trial different across the three groups in this study? Inspection of the figure suggests that it is higher in Group 30 min relative to Groups 6 hours and 24 hours. If this is confirmed by the analysis, it has implications for the fear recovery index which is partly based on responses to the CS-. If not for differences in the CS- responses, Groups 30 min and 6 hours are otherwise identical. That is, the claim of differential recovery to the CS1 and CS2 across time may simply an artefact of the way that the recovery index was calculated. This is unfortunate but also an important feature of the data given the way in which the fear recovery index was calculated.

      We have provided detailed analysis to this question in our previous response letter, and we are posting our previous response there:

      Following the reviewer’s comments, we went back and calculated the mean SCR difference of CS- between the first test trial and the last extinction trial for all three studies (see Author response image 1 below). In study 1, there was no difference in the mean CS- SCR (between the first test trial and last extinction trial) between the reminder and no-reminder groups (Kruskal-Wallis test , though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- related SCR was influenced by the test time (30min, 6h or 24h). We also tested the CS- related SCR for the 4 groups in study 3 (where test was conducted 1 hour after the retrieval-extinction training) and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for CS- related SCR and highlight the importance of having the CS- as a control condition to which the CS+ related SCR was compared with.

      Author response image 1.

      (6D) The 6 hour group was clearly tested at a different time of day compared to the 30 min and 24 hour groups. This could have influenced the SCRs in this group and, thereby, contributed to the pattern of results obtained.

      Again, we answered this question in our previous response. Please see the following for our previous response:

      For the 30min and 24h groups, the test phase can be arranged in the morning, in the afternoon or at night. However, for the 6h group, the test phase was inevitably in the afternoon or at night since we wanted to exclude the potential influence of night sleep on the expression of fear memory (see Author response table 1 below). If we restricted the test time in the afternoon or at night for all three groups, then the timing of their extinction training was not matched.

      Author response table 1.

      Nevertheless, we also went back and examined the data for the subjects only tested in the afternoon or at nights in the 30min and 24h groups to match with the 6h group where all the subjects were tested either in the afternoon or at night. According to the table above, we have 17 subjects for the 30min group (9+8),18 subjects for the 24h group (9 + 9) and 26 subjects for the 6h group (12 + 14). As Author response image 2 shows, the SCR patterns in the fear acquisition, extinction and test phases were similar to the results presented in the original figure.

      Author response image 2.

      (6E) The authors find different patterns of responses to CS1 and CS2 when they were tested 30 min after extinction versus 24 h after extinction. On this basis, they infer distinct memory update mechanisms. However, I still can't quite see why the different patterns of responses at these two time points after extinction need to be taken to infer different memory update mechanisms. That is, the different patterns of responses at the two time points could be indicative of the same "memory update mechanism" in the sense that the retrieval-extinction procedure induces a short-term memory suppression that serves as the basis for the longer-term memory suppression (i.e., the reconsolidation effect). My pushback on this point is based on the notion of what constitutes a memory update mechanism; and is motivated by what I take to be a rather loose use of language/terminology in the reconsolidation literature and this paper specifically (for examples, see the title of the paper and line 2 of the abstract).

      As we mentioned previously, the term “mechanism” might have different connotations for different researchers. We aim to report a novel memory deficit following the retrieval-extinction paradigm, which differed significantly from the purported reconsolidation related long-term fear amnesia in terms of its timescale, cue-specificity and thought-control ability. Further TMS study confirmed that the intact dlPFC function is necessary for the short-term memory deficit. It’s based on these results we proposed that the short-term fear amnesia might be related to a different cognitive “mechanism”. As mentioned above, we now clarify what we mean by “mechanism” in the abstract and introduction (lines 31-34, 97-101).

      Reviewer #2 (Public review):

      The fear acquisition data is converted to a differential fear SCR and this is what is analysed (early vs late). However, the figure shows the raw SCR values for CS+ and CS- and therefore it is unclear whether acquisition was successful (despite there being an "early" vs "late" effect - no descriptives are provided).

      (1) There are still no descriptive statistics to substantiate learning in Experiment 1.

      We answered this question in our previous response letter. We are sorry that the definition of “early” and “late” trials was scattered in the manuscript. For example, we wrote “the late phase of acquisition (last 5 trials)” (Line 375-376) in the results section. Since there were 10 trials in total for the acquisition stage, we define the first 5 trials and the last 5 trials as “early” and “late” phases of the acquisition stage and explicitly added them into the first occasion “early” and “late” terms appeared (lines 316-318).

      In the results section, we did test whether the acquisition was successful in our previous manuscript (Line 316-325):

      “To assess fear acquisition across groups (Figure 1B and C), we conducted a mixed two-way ANOVA of group (reminder vs. no-reminder) x time (early vs. late part of the acquisition; first 5 and last 5 trials, correspondingly) on the differential fear SCR. Our results showed a significant main effect of time (early vs. late; F<sub>1,55</sub> \= 6.545, P \= 0.013, η<sup>2</sup> \= 0.106), suggesting successful fear acquisition in both groups. There was no main effect of group (reminder vs. no-reminder) or the group x time interaction (group: F<sub>1,55</sub> \= 0.057, P \= 0.813, η<sup>2</sup> \= 0.001; interaction: F<sub>1,55</sub> \= 0.066, P \= 0.798, η<sup>2</sup> \= 0.001), indicating similar levels of fear acquisition between two groups. Post-hoc t-tests confirmed that the fear responses to the CS+ were significantly higher than that of CS- during the late part of acquisition phase in both groups (reminder group: t<sub>29</sub> \= 6.642, P < 0.001; no-reminder group: t<sub>26</sub> = 8.522, P < 0.001; Figure 1C). Importantly, the levels of acquisition were equivalent in both groups (early acquisition: t<sub>55</sub> \= -0.063, P \= 0.950; late acquisition: t<sub>55</sub> \= -0.318, P \= 0.751; Figure 1C).”

      In Experiment 1 (Test results) it is unclear whether the main conclusion stems from a comparison of the test data relative to the last extinction trial ("we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS") or the difference relative to the CS- ("differential fear recovery index between CS+ and CS-"). It would help the reader assess the data if Fig 1e presents all the indexes (both CS+ and CS-). In addition, there is one sentence which I could not understand "there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (P=0.048)". The p value suggests that there is a difference, yet it is not clear what is being compared here. Critically, any index taken as a difference relative to the CS- can indicate recovery of fear to the CS+ or absence of discrimination relative to the CS-, so ideally the authors would want to directly compare responses to the CS+ in the reminder and no-reminder groups. In the absence of such comparison, little can be concluded, in particular if SCR CS- data is different between groups. The latter issue is particularly relevant in Experiment 2, in which the CS- seems to vary between groups during the test and this can obscure the interpretation of the result.

      (2) In the revised analyses, the authors now show that CS- changes in different groups (for example, Experiment 2) so this means that there is little to conclude from the differential scores because these depend on CS-. It is unclear whether the effects arise from CS+ performance or the differential which is subject to CS- variations.

      There was a typo in the “P = 0.048” sentence and we have corrected it in our last response letter. Also in the previous response letter, we specifically addressed how the fear recovery index was defined (also in the revised manuscript).

      In most of the fear conditioning studies, CS- trials were included as the baseline control. In turn, most of the analyses conducted also involved comparisons between different groups. Directly comparing CS+ trials across groups (or conditions) is rare. In our study 2, we showed that the CS- response decreased as a function of testing delays (30min, 1hr, 6hr and 24hr). Ideally, it would be nice to show that the CS- across groups/conditions did not change. However, even in those circumstances, comparisons are still based on the differential CS response (CS+ minus CS-), that is, the difference of difference. It is also important to note that difference score is important as CS+ alone or across conditions is difficult to interpret, especially in humans, due to noise, signal fluctuations, and irrelevant stimulus features; therefore trials-wise reference is essential to assess the CS+ in the context of a reference stimulus in each trial (after all, the baselines are different). We are listing a few influential papers in the field that the CS- responses were not particularly equivalent across groups/conditions and argue that this is a routine procedure (Kindt & Soeter 2018 Figs. 2-3; Sevenster et al., 2013 Fig. 3; Liu et al., 2014 Fig. 1; Raio et al., 2017 Fig. 2).

      In experiment 1, the findings suggest that there is a benefit of retrieval followed by extinction in a short-term reinstatement test. In Experiment 2, the same effect is observed to a cue which did not undergo retrieval before extinction (CS2+), a result that is interpreted as resulting from cue-independence, rather than a failure to replicate in a within-subjects design the observations of Experiment 1 (between-subjects). Although retrieval-induced forgetting is cue-independent (the effect on items that are suppressed [Rp-] can be observed with an independent probe), it is not clear that the current findings are similar, and thus that the strong parallels made are not warranted. Here, both cues have been extinguished and therefore been equally exposed during the critical stage.

      (3) The notion that suppression is automatic is speculative at best

      We have responded the same question in our previous revision. Please note that our results from study 1 (the comparison between reminder and no-reminder groups) was not set up to test the cue-independence hypothesis for the short-term amnesia with only one CS+. Results from both study 2 (30min condition) and study 3 confirmed the cue-independence hypothesis and therefore we believe interpreting results from study 2 as “a failure to replicate in a within-subject design of the observations of Experiment 1” is not the case.

      We agree that the proposal of automatic or unconscious memory suppression is speculative and that’s why we mentioned it in the discussion. The timescale, cue-specificity and the thought-control ability dependence of the short-term fear amnesia identified in our studies was reminiscent of the memory suppression effects reported in the previous literature. However, memory suppression typically adopted a conscious “suppression” treatment (such as the think/no-think paradigm), which was absent in the current study. However, the retrieval-induced forgetting (RIF), which is also considered a memory suppression paradigm via inhibitory control, does not require conscious effort to suppress any particular thought. Based on these results and extant literature, we raised the possibility of memory suppression as a potential mechanism. We make clear in the discussion that the suppression hypothesis and connections with RIF will require further evidence (lines 615-616):

      “future research will be needed to investigate whether the short-term effect we observed is specifically related to associative memory or the spontaneous nature of suppression as in RIF (Figure 6C).”

      (4) It still struggle with the parallels between these findings and the "limbo" literature. Here you manipulated the retention interval, whereas in the cited studies the number of extinction (exposure) was varied. These are two completely different phenomena.

      We borrowed the “limbo” term to stress the transitioning from short-term to long-term memory deficits (the 6hr test group). Merlo et al. (2014) found that memory reconsolidation and extinction were dissociable processes depending on the extent of memory retrieval. They argued that there was a “limbo” transitional state, where neither the reconsolidation nor the extinction process was engaged. Our results suggest that at the test delay of 6hr, neither the short-term nor the long-term effect was present, signaling a “transitional” state after which the short-term memory deficit wanes and the long-term deficit starts to take over. We make this idea more explicit as follows (lines 622-626):

      “These works identified important “boundary conditions” of memory retrieval in affecting the retention of the maladaptive emotional memories. In our study, however, we showed that even within a boundary condition previously thought to elicit memory reconsolidation, mnemonic processes other than reconsolidation could also be at work, and these processes jointly shape the persistence of fear memory.”

      (5) My point about the data problematic for the reconsolidation (and consolidation) frameworks is that they observed memory in the absence of the brain substrates that are needed for memory to be observed. The answer did not address this. I do not understand how the latent cause model can explain this, if the only difference is the first ITI. Wouldn't participants fail to integrate extinction with acquisition with a longer ITI?

      We take the sentence “they observed memory in the absence of the brain substrates that are needed for memory to be observed” as referring to the long-term memory deficit in our study. As we responded before, the aim of this manuscript was not about investigating the brain substrates involved in memory reconsolidation (or consolidation). Using a memory retrieval-extinction paradigm, we discovered a novel short-term memory effect, which differed from the purported reconsolidation effect in terms of timescale, cue-specificity and thought-control ability dependence. We further showed that both memory retrieval and intact dlPFC functions were necessary to observe the short-term memory deficit effect. Therefore, we conclude that the brain mechanism involved in such an effect should be different from the one related to the purported reconsolidation effect. We make this idea more explicit as follows (lines 546-547):

      “Therefore, findings of the short-term fear amnesia suggest that the reconsolidation framework falls short to accommodate this more immediate effect (Figure 6A and B).”

      Whilst I could access the data in the OFS site, I could not make sense of the Matlab files as there is no signposting indicating what data is being shown in the files. Thus, as it stands, there is no way of independently replicating the analyses reported.

      (6) The materials in the OSF site are the same as before, they haven't been updated.

      Last time we thought the main issue was the OSF site not being publicly accessible and thus made it open to all visitors. We have added descriptive file to explain the variables to help visitors to replicate the analyses we took.

      (7) Concerning supplementary materials, the robustness tests are intended to prove that you 1) can get the same results by varying the statistical models or 2) you can get the same results when you include all participants. Here authors have done both so this does not help. Also, in the rebuttal letter, they stated "Please note we did not include non-learners in these analyses " which contradicts what is stated in the figure captions "(learners + non learners)"

      In the supplementary materials, we did the analyses of varying the statistical models and including both learners and non-learners separately, instead of both. In fact, in the supplementary material Figs. 1 & 2, we included all the participants and performed similar analysis as in the main text and found similar results (learners + non-learners). Also, in the text of the supplementary material, we used a different statistical analysis method to only learners (analyzing subjects reported in the main text using a different method) and achieved similar results. We believe this is exactly what the reviewer suggested us to do. Also there seems to be a misunderstanding for the "Please note we did not include non-learners in these analyses" sentence in the rebuttal letter. As the reviewer can see, the full sentence read “Please note we did not include non-learners in these analyses (the texts of the supplementary materials)”. We meant to express that the Figures and texts in the supplementary material reflect two approaches: 1) Figures depicting re-analysis with all the included subjects (learners + non learners); 2) Text describing different analysis with learners. We added clarifications to emphasize these approaches in the supplementary materials.

      (8) Finally, the literature suggesting that reconsolidation interference "eliminates" a memory is not substantiated by data nor in line with current theorising, so I invite a revision of these strong claims.

      We agree and have toned down the strong claims.

      Overall, I conclude that the revised manuscript did not address my main concerns.

      In both rounds of responses, we tried our best to address the reviewer’s concerns. We hope that the clarifications in this letter and revisions in the text address the remaining concerns. Thank you for your feedback.

      Reference:

      Kindt, M. and Soeter, M. 2018. Pharmacologically induced amnesia for learned fear is time and sleep dependent. Nat Commun, 9, 1316.

      Liu, J., Zhao, L., Xue, Y., Shi, J., Suo, L., Luo, Y., Chai, B., Yang, C., Fang, Q., Zhang, Y., Bao, Y., Pickens, C. L. and Lu, L. 2014. An unconditioned stimulus retrieval extinction procedure to prevent the return of fear memory. Biol Psychiatry, 76, 895-901.

      Raio, C. M., Hartley, C. A., Orederu, T. A., Li, J. and Phelps, E. A. 2017. Stress attenuates the flexible updating of aversive value. Proc Natl Acad Sci U S A, 114, 11241-11246.

      Sevenster, D., Beckers, T., & Kindt, M. 2013. Prediction error governs pharmacologically induced amnesia for learned fear. Science (New York, N.Y.), 339(6121), 830–833.

    1. Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      Strengths

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies<br /> (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies<br /> (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Weaknesses

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Editors’ note: Reviewer #2 was unavailable to re-review the manuscript. Reviewer #3 was added for this round of review to ensure two reviewers and because of their expertise in the computational and modelling aspects of the work.

    2. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment<br /> This study offers valuable insights into how humans detect and adapt to regime shifts, highlighting distinct contributions of the frontoparietal network and ventromedial prefrontal cortex to sensitivity to signal diagnosticity and transition probabilities. The combination of an innovative task design, behavioral modeling, and model-based fMRI analyses provides a solid foundation for the conclusions; however, the neuroimaging results have several limitations, particularly a potential confound between the posterior probability of a switch and the passage of time that may not be fully controlled by including trial number as a regressor. The control experiments intended to address this issue also appear conceptually inconsistent and, at the behavioral level, while informing participants of conditional probabilities rather than requiring learning is theoretically elegant, such information is difficult to apply accurately, as shown by well-documented challenges with conditional reasoning and base-rate neglect. Expressing these probabilities as natural frequencies rather than percentages may have improved comprehension. Overall, the study advances understanding of belief updating under uncertainty but would benefit from more intuitive probabilistic framing and stronger control of temporal confounds in future work.

      We thank the editors for the assessment. The editor added several limitations based on the new reviewer 3 in this round, which we address below.

      With regard to temporal confounds, we clarified in the main text and response to Reviewer 3 that we had already addressed the potential confound between posterior probability of a switch and passage of time in GLM-2 with the inclusion of intertemporal prior. After adding intertemporal prior in the GLM, we still observed the same fMRI results on probability estimates. In addition, we did two other robustness checks, which we mentioned in the manuscript.

      With regard to response mode (probability estimation rather than choice or indicating natural frequencies), we wish to point out that the in previous research by Massey and Wu (2005), which the current study was based on, the concern of participants showing system-neglect tendencies due to the mode of information delivery, namely indicating beliefs through reporting probability estimates rather than through choice or other response mode was addressed. Massy and Wu (2005, Study 3) found the same biases when participants performed a choice task that did not require them to indicate probability estimates.

      With regard to the control experiments, the control experiments in fact were not intended to address the confounds between posterior probability and passage of time. Rather, they aimed to address whether the neural findings were unique to change detection (Experiment 2) and to address visual and motor confounds (Experiment 3). These and the results of the control experiments were mentioned on page 18-19.

      Finally, we wish to highlight that we had performed detailed model comparisons after reviewer 2’s suggestions. Although reviewer 2 was unable to re-review the manuscript, we believe this provides insight into the literature on change detection. See “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection” (p.27-30). The model comparison showed that system-neglect models that incorporate signal dependency are better models than the original system-neglect model in describing participants probability estimates. This suggests that people respond to change-consistent and change-inconsistent signals differently when judging whether the regime had changed. This was not reported in previous behavioral studies and was largely inspired by the neural finding on signal dependency in the frontoparietal cortex. It indicates that neural findings can provide novel insights into computational modeling of behavior.           

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      - The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      - The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well. The model is comprehensively validated.

      - The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      We thank the reviewer for the comments.

      Weaknesses:

      The authors have adequately addressed most of my prior concerns.

      We thank the reviewer for recognizing our effort in addressing your concerns.

      My only remaining comment concerns the z-test of the correlations. I agree with the non-parametric test based on bootstrapping at the subject level, providing evidence for significant differences in correlations within the left IFG and IPS.

      However, the parametric test seems inadequate to me. The equation presented is described as the Fisher z-test, but the numerator uses the raw correlation coefficients (r) rather than the Fisher-transformed values (z). To my understanding, the subtraction should involve the Fisher z-scores, not the raw correlations.

      More importantly, the Fisher z-test in its standard form assumes that the correlations come from independent samples, as reflected in the denominator (which uses the n of each independent sample). However, in my opinion, the two correlations are not independent but computed within-subject. In such cases, parametric tests should take into account the dependency. I believe one appropriate method for the current case (correlated correlation coefficients sharing a variable [behavioral slope]) is explained here:

      Meng, X.-l., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172-175. https://doi.org/10.1037/0033-2909.111.1.172

      It should be implemented here:

      Diedenhofen B, Musch J (2015) cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10(4): e0121945. https://doi.org/10.1371/journal.pone.0121945

      My recommendation is to verify whether my assumptions hold, and if so, perform a test that takes correlated correlations into account. Or, to focus exclusively on the non-parametric test.

      In any case, I recommend a short discussion of these findings and how the authors interpret that some of the differences in correlations are not significant.

      Thank you for the careful check. Yes. This was indeed a mistake from us. We also agree that the two correlations are not independent. Therefore, we modified the test that accounts for dependent correlations by following Meng et al. (1992) suggested by the reviewer.

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as , and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. To statistically compare these two correlations, we adopted the approach of Meng et al. (1992), which specifically tests differences between dependent correlations according to the following equation

      where  is the number of subjects, 𝑧<sub>𝑟𝑖</sub> is the Fisher z-transformed value of 𝑟<sub>𝑖</sub>, 𝑟<sub>1</sub> = 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> = 𝑟<sub>𝑟𝑒𝑑</sub>. 𝑟<sub>𝑥</sub> is the correlation between the neural sensitivity at change-consistent signals and change-inconsistent signals.

      Where is the mean of the , and 𝑓 should be set to 1 if > 1.

      We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8908, 𝑝 = 0.0293; left IPS: 𝑧 = 2.2584, 𝑝 = 0.0049). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.9522, 𝑝 = 0.1705; right IFG: 𝑧 = 0.9860, 𝑝 = 0.1621; right IPS: 𝑧 = 1.4833, 𝑝 = 0.0690). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0. These updated results are consistent with the nonparametric tests we had already performed and we will update them in the revised manuscript.

      Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      We thank the reviewer for the overall descriptions of the manuscript.

      Strengths:

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies<br /> (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Thank you for these assessments.

      Weaknesses:

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      We appreciate the reviewer’s concern on this issue. The concern was addressed in Massey and Wu (2005) as participants performed a choice task in which they were not asked to provide probability estimates (Study 3 in Massy and Wu, 2005). Instead, participants in Study 3 were asked to predict the color of the ball before seeing a signal. This was a more intuitive way of indicating his or her belief about regime shift. The results from the choice task were identical to those found in the probability estimation task (Study 1 in Massey and Wu). We take this as evidence that the system-neglect behavior the participants showed was less likely to be due to the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      We thank the reviewer for this comment. It is true that the system-neglect model is not entirely inconsistent with regression to the mean, regardless of whether the implementation has a hyper prior or not. In fact, our behavioral measure of sensitivity to transition probability and signal diagnosticity, which we termed the behavioral slope, is based on linear regression analysis. In general, the modeling approach in this paper is to start from a generative model that defines ideal performance and consider modifying the generative model when systematic deviations in actual performance from the ideal is observed. In this approach, a generative model with hyper-prior would be more complex to begin with, and a regression to the mean idea by itself does not generate a priori predictions.

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      Thank you for raising this point. The modeling principle we adopt is the following. We start from the normative model—the Bayesian model—that defined what normative behavior should look like. We compared participants’ behavior with the Bayesian model and found systematic deviations from it. To explain those systematic deviations, we considered modeling options within the confines of the same modeling framework. In other words, we considered a parameterized version of the Bayesian model, which is the system-neglect model and examined through model comparison the best modeling choice. This modeling approach is not uncommon, and many would agree this is the standard approach in economics and psychology. For example, Kahneman and Tversky adopted this approach when proposing prospect theory, a modification of expected utility theory where expected utility theory can be seen as one specific model for how utility of an option should be computed.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      Thank you for raising this concern. Yes, Pt always increases with sample number regardless of evidence (seeing change-consistent or change-inconsistent signals). This is captured by the ‘intertemporal prior’ in the Bayesian model, which we included as a regressor in our GLM analysis (GLM-2), in addition to Pt. In short, GLM-1 had Pt and sample number. GLM-2 had Pt, intertemporal prior, and sample number, among other regressors. And we found that, in both GLM-1 and GLM-2, both vmPFC and ventral striatum correlated with Pt.

      To make this clearer, we updated the main text to further clarify this on p.18:

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      We thank the reviewer for this comment. The purpose of Experiment 3 was to control for visual and motor confounds. In other words, if subjects saw the similar visual layout and were just instructed to press numbers, would we observe the vmPFC, ventral striatum, and the frontoparietal network like what we did in the main experiment (Experiment 1)?

      The purpose of Experiment 2 was to establish whether what we found about Pt was unique to change detection. In Experiment 2, subjects estimated the probability that the current regime is the blue regime (just as they did in Experiment 1) except that there were no regime shifts involved. In other words, it is possible that the regions we identified were generally associated with probability estimation and not particularly about change detection. And we used Experiment 2 to examine whether this were true.

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Thank you. We received different feedbacks from previous reviews on what to include in Discussion. To address the reviewer’s concern, we will revise the Discussion to better highlight the key contributions of the current study at the beginning of Discussion.

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      Many of the figures are too tiny - the writing is very small, as are the pictures of brains. I'd suggest adjusting these so they will be readable without enlarging.

      Thank you. We will enlarge the figures to make them more readable.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      (1) The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      Thank you for recognizing our contribution to the regime-change detection literature and our effort in discussing our findings in relation to the experience-based paradigms.

      (2) The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well.

      Thank you for recognizing the contribution of our Bayesian framework and systemneglect model.

      (3) The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      Thank you for recognizing our execution of model-based fMRI analyses and effort in using those analyses to link with behavioral biases.

      Weaknesses:

      My major concern is about the correlational analysis in the section "Under- and overreactions are associated with selectivity and sensitivity of neural responses to system parameters", shown in Figures 5c and d (and similarly in Figure 6). The authors argue that a frontoparietal network selectively represents sensitivity to signal diagnosticity, while the vmPFC selectively represents transition probabilities. This claim is based on separate correlational analyses for red and blue across different brain areas. The authors interpret the finding of a significant correlation in one case (blue) and an insignificant correlation (red) as evidence of a difference in correlations (between blue and red) but don't test this directly. This has been referred to as the "interaction fallacy" (Niewenhuis et al., 2011; Makin & Orban de Xivry 2019). Not directly testing the difference in correlations (but only the differences to zero for each case) can lead to wrong conclusions. For example, in Figure 5c, the correlation for red is r = 0.32 (not significantly different from zero) and r = 0.48 (different from zero). However, the difference between the two is 0.1, and it is likely that this difference itself is not significant. From a statistical perspective, this corresponds to an interaction effect that has to be tested directly. It is my understanding that analyses in Figure 6 follow the same approach.

      Relevant literature on this point is:

      Nieuwenhuis, S, Forstmann, B & Wagenmakers, EJ (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14, 11051107. https://doi.org/10.1038/nn.2886

      Makin TR, Orban de Xivry, JJ (2019). Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8:e48175. https://doi.org/10.7554/eLife.48175

      There is also a blog post on simulation-based comparisons, which the authors could check out: https://garstats.wordpress.com/2017/03/01/comp2dcorr/

      I recommend that the authors carefully consider what approach works best for their purposes. It is sometimes recommended to directly compare correlations based on Monte-Carlo simulations (cf Makin & Orban). It might also be appropriate to run a regression with the dependent variable brain activity (Y) and predictors brain area (X) and the model-based term of interest (Z). In this case, they could include an interaction term in the model:

      Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot Z + \beta_3 \cdot X \cdot Z

      The interaction term reflects if the relationship between the model term Z and brain activity Y is conditional on the brain area of interest X.

      Thank you for the suggestion. In response, we tested for the difference in correlation both parametrically and nonparametrically. The results were identical. In the parametric test, we used the Fisher z transformation to transform the difference in correlation coefficients to the z statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>1</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>2</sub>), the z statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher z transformation 𝑟<sub>1</sub>= 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 =0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0.

      In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). We resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution. Consistent with our parametric tests, here we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631).

      In summary, we found that neural sensitivity to signal diagnosticity in the frontoparietal network measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity (𝑟<sub>𝑏𝑙𝑢𝑒</sub>). By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent did not significantly correlate with behavioral sensitivity (𝑟<sub>𝑟𝑒𝑑</sub>). The difference in correlation, 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub>, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.

      To incorporate these updates, we added descriptions of the methods and results in the revised manuscript. In the Results section (p.26-27):

      “We further tested, for each brain region, whether the difference in correlation was significant using both parametric and nonparametric tests (see Parametric and nonparametric tests for difference in correlation coefficients in Methods). The results were identical. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 = 0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under change-consistent signals was significantly greater than 0. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation. We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. Consistent with the parametric tests, we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \=0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631). In summary, we found that neural sensitivity to signal diagnosticity measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity. By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent signals did not significantly correlate with behavioral sensitivity. The difference in correlation, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.”

      In the Methods section, we added on p.53:

      “Parametric and nonparametric tests for difference in correlation coefficients. We implemented both parametric and nonparametric tests to examine whether the difference in Pearson correlation coefficients was significant. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>2</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>1</sub>), the 𝑧 statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at changeconsistent (blue balls) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red balls) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher 𝑧 transformation, 𝑟<sub>1</sub> \= 𝑟 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). That is, we resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution.”

      Another potential concern is that some important details about the parameter estimation for the system-neglect model are missing. In the respective section in the methods, the authors mention a nonlinear regression using Matlab's "fitnlm" function, but it remains unclear how the model was parameterized exactly. In particular, what are the properties of this nonlinear function, and what are the assumptions about the subject's motor noise? I could imagine that by using the inbuild function, the assumption was that residuals are Gaussian and homoscedastic, but it is possible that the assumption of homoscedasticity is violated, and residuals are systematically larger around p=0.5 compared to p=0 and p=1. Relatedly, in the parameter recovery analyses, the authors assume different levels of motor noise. Are these values representative of empirical values?

      We thank the reviewer for this excellent point. The reviewer touched on model parameterization, assumption of noise, and parameter recovery analysis. We answered these questions point-by-point below.

      On how our model was parameterized

      We parameterized the model according to the system-neglect model in Eq. (2) and estimated the alpha parameter separately for each level of transition probability and the beta parameter separately for each level of signal diagnosticity. As a result, we had a total of 6 parameters (3 alpha and 3 beta parameters) in the model. The system-neglect model is then called by fitnlm so that these parameters can be estimated. The term ‘nonlinear’ regression in fitnlm refers to the fact that you can specify any model (in our case the system-neglect model) and estimate its parameters when calling this function. In our use of fitnlm, we assume that the noise is Gaussian and homoscedastic (the default option).

      On the assumptions about subject’s motor noise

      We actually never called the noise ‘motor’ because it can be estimation noise as well. In the context of fitnlm, we assume that the noise is Gaussian and homoscedastic.

      On the possibility that homoscedasticity is violated

      We take the reviewer’s point. In response, we separately estimated the residual standard deviation at different probability intervals ([0.0–0.2), [0.2–0.4), [0.4–0.6), [0.6– 0.8), and [0.8–1.0]). The result is shown in the figure below. The black data points are the average residual standard deviation (across subjects) and the error bars are the standard error of the mean. The residual standard deviation is indeed heteroscedastic— smallest at 0.1 probability and increasing as probability increases and asymptote at 0.5 (Fig. S4).

      To examine how this would affect model fitting (parameter estimation), we performed parameter recovery analysis based on these empirically estimated, probabilitydependent residual standard deviation. That is, we simulated subjects’ probability estimates using the system-neglect model and added the heteroscedastic noise according to the empirical values and then estimated the parameter estimates of the system-neglect model. The recovered parameter estimates did not seem to be affected by the heteroscedasticity of the variance. The parameter recovery results were identical to the parameter recovery results when homoscedasticity was assumed. This suggested that although homoscedasticity was violated, it did not affect the accuracy of the parameter estimates (Fig.S4).

      We added a section ‘Impact of noise homoscedasticity on parameter estimation’ in Methods section (p.47-48) and a figure in the supplement (Fig. S4) to describe this:

      On whether the noise levels in parameter recovery analysis are representative of empirical values

      To address the reviewer’s question, we conducted a new analysis using maximum likelihood estimation to simultaneously estimate the system-neglect model and the noise level of each individual subject. To estimate each subject’s noise level, we incorporated a noise parameter into the system-neglect model. We assumed that probability estimates are noisy and modeled them with a Gaussian distribution where the noise parameter (𝜎,-./&) is the standard deviation. At each period, a probability estimate of regime shift was computed according to the system-neglect model where Θ is the set of parameters including parameters in the system-neglect model and the noise parameter. The likelihood function, 𝐿(Θ), is the probability of observing the subject’s actual probability estimate at period 𝑡, 𝑝), given Θ, 𝐿(Θ) = 𝑃(𝑝)|Θ). Since we modeled the noisy probability estimates with a Gaussian distribution, we can therefore express 𝐿(Θ) as 𝐿(Θ)~𝑁(𝑝); 𝑝)*+, 𝜎,-./&) where 𝑝)*+ is the probability estimate predicted by the system-neglect (SN) model at period 𝑡. As a reminder, we referred to a ‘period’ as the time when a new signal appeared during a trial (for a given transition probability and signal diagnosticity). To find that maximum likelihood estimates of ΘMLE, we summed over all periods the negative natural logarithm of likelihood and used MATLAB’s fmincon function to find ΘMLE. Across subjects, we found that the mean noise estimate was 0.1735 and ranged from 0.1118 to 0.2704 (Supplementary Figure S3).”

      Compared with our original parameter recovery analysis where the maximum noise level was set at 0.1, our data indicated that some subjects’ noise was larger than this value. Therefore, we expanded our parameter recovery analysis to include noise levels beyond 0.1 to up to 0.3. The results are now updated in Supplementary Fig. S3.

      We updated the parameter recovery section (p. 47) in Methods:

      The main study is based on N=30 subjects, as are the two control studies. Since this work is about individual differences (in particular w.r.t. to neural representations of noise and transition probabilities in the frontoparietal network and the vmPFC), I'm wondering how robust the results are. Is it likely that the results would replicate with a larger number of subjects? Can the two control studies be leveraged to address this concern to some extent?

      We can address the issue of robustness through looking at the effect size. In particular, with respect to individual differences in neural sensitivity of transition probability and signal diagnosticity, since the significant correlation coefficients between neural and behavioral sensitivity were between 0.4 and 0.58 for signal diagnosticity in frontoparietal network (Fig. 5C), and -0.38 and -0.37 for transition probability in vmPFC (Fig. 5D), the effect size of these correlation coefficients was considered medium to large (Cohen, 1992).

      It would be challenging to use the control studies to address the robustness concern. The two control studies did not allow us to examine individual differences – in particular with respect to neural selectivity of noise and transition probability – and therefore we think it is less likely to leverage the control studies. Having said that, it is possible to look at neural selectivity of noise (signal diagnosticity) in the first control experiment where subjects estimated the probability of blue regime in a task where there was no regime change (transition probability was 0). However, the fact that there were no regime shifts changed the nature of the task. Instead of always starting at the Red regime in the main experiment, in the first control experiment we randomly picked the regime to draw the signals from. It also changed the meaning and the dynamics of the signals (red and blue) that would appear. In the main experiment the blue signal is a signal consistent with change, but in the control experiment this is no longer the case. In the main experiment, the frequency of blue signals is contingent upon both noise and transition probability. In general, blue signals are less frequent than red signals because of small transition probabilities. But in the first control experiment, the frequency of blue signals may not be less frequent because the regime was blue in half of the trials. Due to these differences, we do not see how analyzing the control experiments could help in establishing robustness because we do not have a good prediction as to whether and how the neural selectivity would be impacted by these differences.

      It seems that the authors have not counterbalanced the colors and that subjects always reported the probability of the blue regime. If so, I'm wondering why this was not counterbalanced.

      We are aware of the reviewer’s concern. The first reason we did not do these (color counterbalancing and report blue/red regime balancing) was to not confuse the subjects in an already complicated task. Balancing these two variables also comes at the cost of sample size, which was the second reason we did not do it. Although we can elect to do these balancing at the between-subject level to not impact the task complexity, we could have introduced another confound that is the individual differences in how people respond to these variables. This is the third reason we were hesitant to do these counterbalancing.

      Reviewer #2 (Public review):

      Summary:

      This paper focuses on understanding the behavioral and neural basis of regime shift detection, a common yet hard problem that people encounter in an uncertain world.

      Using a regime-shift task, the authors examined cognitive factors influencing belief updates by manipulating signal diagnosticity and environmental volatility. Behaviorally, they have found that people demonstrate both over and under-reaction to changes given different combinations of task parameters, which can be explained by a unified system-neglect account. Neurally, the authors have found that the vmPFC-striatum network represents current belief as well as belief revision unique to the regime detection task. Meanwhile, the frontoparietal network represents cognitive factors influencing regime detection i.e., the strength of the evidence in support of the regime shift and the intertemporal belief probability. The authors further link behavioral signatures of system neglect with neural signals and have found dissociable patterns, with the frontoparietal network representing sensitivity to signal diagnosticity when the observation is consistent with regime shift and vmPFC representing environmental volatility, respectively. Together, these results shed light on the neural basis of regime shift detection especially the neural correlates of bias in belief update that can be observed behaviorally.

      Strengths:

      (1) The regime-shift detection task offers a solid ground to examine regime-shift detection without the potential confounding impact of learning and reward. Relatedly, the system-neglect modeling framework provides a unified account for both over or under-reacting to environmental changes, allowing researchers to extract a single parameter reflecting people's sensitivity to changes in decision variables and making it desirable for neuroimaging analysis to locate corresponding neural signals.

      Thank you for recognizing our task design and our system-neglect computational framework in understanding change detection.

      (2) The analysis for locating brain regions related to belief revision is solid. Within the current task, the authors look for brain regions whose activation covary with both current belief and belief change. Furthermore, the authors have ruled out the possibility of representing mere current belief or motor signal by comparing the current study results with two other studies. This set of analyses is very convincing.

      Thank you for recognizing our control studies in ruling out potential motor confounds in our neural findings on belief revision.

      (3) The section on using neuroimaging findings (i.e., the frontoparietal network is sensitive to evidence that signals regime shift) to reveal nuances in behavioral data (i.e., belief revision is more sensitive to evidence consistent with change) is very intriguing. I like how the authors structure the flow of the results, offering this as an extra piece of behavioral findings instead of ad-hoc implanting that into the computational modeling.

      Thank you for appreciating how we showed that neural insights can lead to new behavioral findings.

      Weaknesses:

      (1) The authors have presented two sets of neuroimaging results, and it is unclear to me how to reason between these two sets of results, especially for the frontoparietal network. On one hand, the frontoparietal network represents belief revision but not variables influencing belief revision (i.e., signal diagnosticity and environmental volatility). On the other hand, when it comes to understanding individual differences in regime detection, the frontoparietal network is associated with sensitivity to change and consistent evidence strength. I understand that belief revision correlates with sensitivity to signals, but it can probably benefit from formally discussing and connecting these two sets of results in discussion. Relatedly, the whole section on behavioral vs. neural slope results was not sufficiently discussed and connected to the existing literature in the discussion section. For example, the authors could provide more context to reason through the finding that striatum (but not vmPFC) is not sensitive to volatility.

      We thank the reviewer for the valuable suggestions.

      With regard to the first comment, we wish to clarify that we did not find frontoparietal network to represent belief revision. It was the vmPFC and ventral striatum that we found to represent belief revision (delta Pt in Fig. 3). For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and -1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Eqs. 1 and 2). We added a paragraph in Discussion to talk about this.

      We added on p. 36:

      “For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and −1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Equations 1 and 2 in Methods).”

      With regard to the second comment, we added a discussion on the behavioral and neural slope comparison. We pointed out previous papers conducting similar analysis (Vilares et al., 2011; Ting et al., 2015; Yang & Wu, 2020), their findings and how they relate to our results. Vilares et al. found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to prior. In the current study, transition probability acts as prior in the system-neglect framework (Eq. 1) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2011) and dynamic environments (current study).

      We added on p. 37-38:

      “In the current study, our psychometric-neurometric analysis focused on comparing behavioral sensitivity with neural sensitivity to the system parameters (transition probability and signal diagnosticity). We measured sensitivity by estimating the slope of behavioral data (behavioral slope) and neural data (neural slope) in response to the system parameters. Previous studies had adopted a similar approach (Ting et al., 2015a; Vilares et al., 2012; Yang & Wu, 2020). For example, Vilares et al. (2012) found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to the prior.

      In the current study, transition probability acts as prior in the system-neglect framework (Eq. 2 in Methods) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2012) and dynamic environments (current study). In addition, distinct from vmPFC in representing sensitivity to transition probability or prior, we found through the behavioral-neural slope comparison that the frontoparietal network represents how sensitive individual decision makers are to the diagnosticity of signals in revealing the true state (regime) of the environment.”

      (2) More details are needed for behavioral modeling under the system-neglect framework, particularly results on model comparison. I understand that this model has been validated in previous publications, but it is unclear to me whether it provides a superior model fit in the current dataset compared to other models (e.g., a model without \alpha or \beta). Relatedly, I wonder whether the final result section can be incorporated into modeling as well - i.e., the authors could test a variant of the model with two \betas depending on whether the observation is consistent with a regime shift and conduct model comparison.

      Thank you for the great suggestion. We rewrote the final Results section to specifically focus on model comparison. To address the reviewer’s suggestion (separately estimate beta parameters for change-consistent and change-inconsistent signals), we indeed found that these models were better than the original system-neglect model.

      To incorporate these new findings, we rewrote the entire final result section “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection “(p.28-30).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Use line numbers for the next round of reviews.

      We added line numbers in the revised manuscript.

      (2) Figure 2b: Can the empirical results be reproduced by the system-neglect model? This would complement the analyses presented in Figure S4.

      Yes. We now add Figure S6 based on system-neglect model fits. For each subject, we first computed period-by-period probability estimates based on the parameter estimates of the system-neglect model. Second, we computed index of overreaction (IO) for each combination of transition probability and signal diagnosticity. Third, we plot the IO like we did using empirical results in Fig. 2b. We found that the empirical results in Fig. 2b are similar to the system-neglect model shown in Figure S6, indicating that the empirical results can be reproduced by the model.

      (3) Page 14: Instead of referring to the "Methods" in general, you could be more specific about where the relevant information can be found.

      Fixed. We changed “See Methods” to “See System-neglect model in Methods”.

      (4) Page 18: Consider avoiding the term "more significantly". Consider effect sizes if interested in comparing effects to each other.

      Fixed. On page 19, we changed that to

      “In the second analysis, we found that for both vmPFC and ventral striatum, the regression coefficient of 𝑃) was significantly different between Experiment 1 and Experiment 2 (Fig. 3C) and between Experiment 1 and Experiment 3 (Fig. 3D; also see Tables S5 and S6 in SI).”

      (5) Page 30: Cite key studies using reversal-learning paradigms. Currently, readers less familiar with the literature might have difficulties with this.

      We now cite key studies using reversal-learning paradigms on p.32:

      “Our work is closely related to the reversal-learning paradigm—the standard paradigm in neuroscience and psychology to study change detection (Fellows & Farah, 2003; Izquierdo et al., 2017; O'Doherty et al., 2001; Schoenbaum et al., 2000; Walton et al., 2010). In a typical reversal-learning task, human or animal subjects choose between two options that differ in the reward magnitude or probability of receiving a reward. Through reward feedback the participants gradually learn the reward contingencies associated with the options and have to update knowledge about reward contingencies when contingencies are switched in order to maximize rewards.”

      Reviewer #2 (Recommendations for the authors):

      (1) Some literature on change detection seems missing. For example, the author should also cite Muller, T. H., Mars, R. B., Behrens, T. E., & O'Reilly, J. X. (2019). Control of entropy in neural models of environmental state. elife, 8, e39404. This paper suggests that medial PFC is correlated with the entropy of the current state, which is closely related to regime change and environmental volatility.

      Thank you for pointing to this paper. We have now added it and other related papers in the Introduction and Discussion.

      In Introduction, we added on p.5-6:

      “Different behavioral paradigms, most notably reversal learning, and computational models were developed to investigate its neurocomputational substrates (Behrens et al., 2007; Izquierdo et al., 2017; Payzan-LeNestour et al., 2011, 2013; Nasser et al., 2010; McGuire et al., 2014; Muller et al., 2019). Key findings on the neural implementations for such learning include identifying brain areas and networks that track volatility in the environment (rate of change) (Behrens et al., 2007), the uncertainty or entropy of the current state of the environment (Muller et al., 2019), participants’ beliefs about change (Payzan-LeNestour et al., 2011; McGuire et al., 2014; Kao et al., 2020), and their uncertainty about whether a change had occurred (McGuire et al., 2014; Kao et al., 2020).”

      In Discussion (p.35), we added a new paragraph:

      “Related to OFC function in decision making and reinforcement learning, Wilson et al. (2014) proposed that OFC is involved in inferring the current state of the environment. For example, medial OFC had been shown to represent probability distribution on possible states of the environment (Chan et al., 2016), the current task state (Schuck et al., 2016) and uncertainty or entropy associated with the state of the environment (Muller et al., 2019). In the context of regime-shift detection, regimes can be regarded as states of the environment and therefore a change in regime indicates a change in the state of the environment. Muller et al. (2019) found that in dynamic environments where changes in the state of the environment happen regularly, medial OFC represented the level of uncertainty in the current state of the environment. Our finding that vmPFC represented individual participants’ probability estimates of regime shifts suggest that vmPFC and/or OFC are involved in inferring the current state of the environment through estimating whether the state has changed. Our finding that vmPFC represented individual participants’ sensitivity to transition probability further suggest that vmPFC and/or OFC contribute to individual participants’ biases in state inference (over- and underreactions to change) in how these brain areas respond to the volatility of the environment.”

      (2) The language used when describing the selective relationship between frontoparietal network activation and change-consistent signal can be clearer. When describing separating those two signals, the authors refer to them as when the 'blue' signal shows up and when the 'red' signal shows up, assuming that the current belief state is blue. This is a little confusing cuz it is hard to keep in mind what is the default color in this example. It would be more intuitive if the author used language such as the 'change consistent' signal.

      Thank you for the suggestion. We have changed the wording according to your suggestion. That is, we say ‘change-consistent (blue) signals’ and ‘change-inconsistent (red) signals’ throughout pages 22-28.

      (3) Figure 4B highlights dmPFC. However, in the associated text, it says p = .10 so it is not significant. To avoid misleading readers, I would recommend pointing this out explicitly beyond saying 'most brain regions in the frontoparietal network also correlated with the intertemporal prior'.

      Thank you for pointing this out. We now say on p.20

      “With independent (leave-one-subject-out, LOSO) ROI analysis, we examined whether brain regions in the frontoparietal network (shown to represent strength of change evidence) correlated with intertemporal prior and found that all brain regions, with the exception of dmPFC, in the frontoparietal network correlated with the intertemporal prior.”

      (4) There is a full paragraph in the discussion talking about the central opercular cortex, but this terminology has not shown up in the main body of the paper. If this is an important brain region to the authors, I would recommend mentioning it more often in the result section.

      Thank you for this suggestion. We have now added central opercular cortex in the Results section (p.18):

      “For 𝑃<sub>𝑡</sub>, we found that the ventromedial prefrontal cortex (vmPFC) and ventral striatum correlated with this behavioral measure of subjects’ belief about change. In addition, many other brain regions, including the motor cortex, central opercular cortex, insula, occipital cortex, and the cerebellum also significantly correlated with 𝑃<sub>𝑡</sub>.”

      (5) The authors have claimed that people make more extreme estimates under high diagnosticity (Supplementary Figure 1). This is an interesting point because it seems to be different from what is shown in the main graph where it seems that people are not extreme enough compared to an ideal Bayesian observer. I understand that these are effects being investigated under different circumstances. It would be helpful if for Supplementary Figure 1 the authors could overlay, or generate a different figure showing what an ideal Bayesian observer would do in this situation.

      We thank the reviewer for pointing this out. We wish to clarify that when we said “more extreme estimates under high diagnosticity” we meant compared with low diagnosticity and not with the ideal Bayesian observer. We clarified this point by rephrasing our sentence on p.11:

      “We also found that subjects tended to give more extreme Pt under high signal diagnosticity than low diagnosticity (Fig. S1 in Supplementary Information, SI).”

      When it comes to comparing subjects’ probability estimates with the normative Bayesian, subjects tended to “underreact” under high diagnosticity. This can be seen in Fig. 4B, which shows a trend of increasing underreaction (or decreasing overreaction) as diagnosticity increased (row-wise comparison for a given transition probability).

      We see the reviewer’s point in overlaying the Bayesian on Fig. S1 and update it by adding the normative Bayesian in orange.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Silbaugh, Koster, and Hansel investigated how the cerebellar climbing fiber (CF) signals influence neuronal activity and plasticity in mouse primary somatosensory (S1) cortex. They found that optogenetic activation of CFs in the cerebellum modulates responses of cortical neurons to whisker stimulation in a cell-type-specific manner and suppresses potentiation of layer 2/3 pyramidal neurons induced by repeated whisker stimulation. This suppression of plasticity by CF activation is mediated through modulation of VIP- and SST-positive interneurons. Using transsynaptic tracing and chemogenetic approaches, the authors identified a pathway from the cerebellum through the zona incerta and the thalamic posterior medial (POm) nucleus to the S1 cortex, which underlies this functional modulation.

      Strengths:

      This study employed a combination of modern neuroscientific techniques, including two-photon imaging, opto- and chemo-genetic approaches, and transsynaptic tracing. The experiments were thoroughly conducted, and the results were clearly and systematically described. The interplay between the cerebellum and other brain regions - and its functional implications - is one of the major topics in this field. This study provides solid evidence for an instructive role of the cerebellum in experience-dependent plasticity in the S1 cortex.

      Weaknesses:

      There may be some methodological limitations, and the physiological relevance of the CFinduced plasticity modulation in the S1 cortex remains unclear. In particular, it has not been elucidated how CF activity influences the firing patterns of downstream neurons along the pathway to the S1 cortex during stimulation.

      Our study addresses the important question of whether CF signaling can influence the activity and plasticity of neurons outside the olivocerebellar system, and further identifies the mechanism through which this indeed occurs. We provide a detailed description of the involvement of specific neuron subtypes and how they are modulated by climbing fiber activation to impact S1 plasticity. We also identify at least one critical pathway from the cerebellar output to the S1 circuit. It is indeed correct that we did not investigate how the specific firing patterns of all of these downstream neurons are affected, or the natural behaviors in which this mechanism is involved. Now that it is established that CF signaling can impact activity and plasticity outside the olivocerebellar system -- and even in the primary somatosensory cortex -- these questions will be important to further investigate in future studies.

      (1) Optogenetic stimulation may have activated a large population of CFs synchronously, potentially leading to strong suppression followed by massive activation in numerous cerebellar nuclear (CN) neurons. Given that there is no quantitative estimation of the stimulated area or number of activated CFs, observed effects are difficult to interpret directly. The authors should at least provide the basic stimulation parameters (coordinates of stim location, power density, spot size, estimated number of Purkinje cells included, etc.).

      As discussed in the paper, we indeed expect that synchronous CF activation is needed to allow for an effect on S1 circuits under natural or optogenetic activation conditions. The basic optogenetic stimulation parameters (also stated in the methods) are as follows: 470 nm LED; Ø200 µm core, 0.39 NA rotary joint patch cable; absolute power output of 2.5 mW; spot size at the surface of the cortex 0.6 mm; estimated power density 8 mW/mm2. A serious estimate of the number of Purkinje cells that are activated is difficult to provide, in particular as ‘activation’ would refer to climbing fiber inputs, not Purkinje cells directly.

      (2) There are CF collaterals directly innervating CN (PMID:10982464). Therefore, antidromic spikes induced by optogenetic stimulation may directly activate CN neurons. On the other hand, a previous study reported that CN neurons exhibit only weak responses to CF collateral inputs (PMID: 27047344). The authors should discuss these possibilities and the potential influence of CF collaterals on the interpretation of the results.

      A direct activation of CN neurons by antidromic spikes in CF collaterals cannot be ruled out. However, we believe that this effect will not be substantial. The activation of the multi-synaptic pathway that we describe in this study is more likely to require a strong nudge as resulting from synchronized Purkinje cell input and subsequent rebound activation in CN neurons (PMID: 22198670), rather than small-amplitude input provided by CF collaterals (PMID: 27047344). A requirement for CF/PC synchronization would also set a threshold for activation of this suppressive pathway.

      (3) The rationale behind the plasticity induction protocol for RWS+CF (50 ms light pulses at 1 Hz during 5 min of RWS, with a 45 ms delay relative to the onset of whisker stimulation) is unclear.

      a) The authors state that 1 Hz was chosen to match the spontaneous CF firing rate (line 107); however, they also introduced a delay to mimic the CF response to whisker stimulation (line 108). This is confusing, and requires further clarification, specifically, whether the protocol was designed to reproduce spontaneous or sensory-evoked CF activity.

      This protocol was designed to mimic sensory-evoked CF activity as reported in Bosman et al (J. Physiol. 588, 2010; PMID: 20724365).

      b) Was the timing of delivering light pulses constant or random? Given the stochastic nature of CF firing, randomly timed light pulses with an average rate of 1Hz would be more physiologically relevant. At the very least, the authors should provide a clear explanation of how the stimulation timing was implemented.

      Light pulses were delivered at a constant 1 Hz. Our goal was to isolate synchrony as the variable distinguishing sensory-evoked from spontaneous CF activity; additionally varying stochasticity, rate, or amplitude would have confounded this. Future studies could explore how these additional parameters shape S1 responses.

      (4) CF activation modulates inhibitory interneurons in the S1 cortex (Figure 2): responses of interneurons in S1 to whisker stimulation were enhanced upon CF coactivation (Figure 2C), and these neurons were predominantly SST- and PV-positive interneurons (Figure 2H, I). In contrast, VIP-positive neurons were suppressed only in the late time window of 650-850 ms (Figure 2G). If the authors' hypothesis-that the activity of VIP neurons regulates SST- and PVneuron activity during RWS+CF-is correct, then the activity of SST- and PV-neurons should also be increased during this late time window. The authors should clarify whether such temporal dynamics were observed or could be inferred from their data.

      Yes, we see a significant activity increase in PV neurons in this late time window (see updates to Data S2). Activity was also increased in SST neurons, though this did not reach statistical significance (Data S2). One reason might be that – given the small effect size overall – such an effect would only be seen in paired recordings. Chemogenetic activity modulation in VIP neurons, which provides a more crude test, shows, however, that SST- and PV-positive interneurons are indeed regulated via inhibition from VIP-positive interneurons (Fig. 5).

      (5) Transsynaptic tracing from CN nicely identified zona incerta (ZI) neurons and their axon terminals in both POm and S1 (Figure 6 and Figure S7).

      a) Which part of the CN (medial, interposed, or lateral) is involved in this pathway is unclear.

      We used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophore) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      b) Were the electrophysiological properties of these ZI neurons consistent with those of PV neurons?

      Although most recorded cells demonstrated electrophysiological properties consistent with PV+ interneurons in other brain regions (i.e. fast spiking, narrow spike width, non-adapting; see Tremblay et al., 2016), interneuron subtypes in the ZI have been incompletely characterized, with SST+ cells showing similar features to those typically associated with PV+ cells (if interested, compare Fig. 4 in DOI: 10.1126/sciadv.abf6709 vs. Fig. S10 in https://doi.org/10.1016/j.neuron.2020.04.027). Therefore, we did not attempt to delineate cell identity based on these characteristics.

      c) There appears to be a considerable number of axons of these ZI neurons projecting to the S1 cortex (Figure S7C). Would it be possible to estimate the relative density of axons projecting to the POm versus those projecting to S1? In addition, the authors should discuss the potential functional role of this direct pathway from the ZI to the S1 cortex.

      An absolute quantification is difficult to provide based on the images that we obtained. However, any crude estimate would indicate the relative density of projections to POm is higher than the density of projections to S1 (this is apparent from the images themselves). While the anatomical and functional connections from POm to S1 have been described in detail (Audette et al., 2018), this is not the case for the direct projections to ZI. A direct ZI to S1 projection would potentially involve a different recruitment of neurons in the S1 circuit. Any discussion on the specific consequences of the activation of this direct pathway would be purely speculative.

      Reviewer #2 (Public review):

      Summary:

      The authors examined long-distance influence of climbing fiber (CF) signaling in the somatosensory cortex by manipulating whiskers through stimulation. Also, they examined CF signaling using two-photon imaging and mapped projections from the cerebellum to the somatosensory cortex using transsynaptic tracing. As a final manipulation, they used chemogenetics to perturb parvalbumin-positive neurons in the zona incerta and recorded from climbing fibers.

      Strengths:

      There are several strengths to this paper. The recordings were carefully performed, and AAVs used were selective and specific for the cell types and pathways being analyzed. In addition, the authors used multiple approaches that support climbing fiber pathways to distal regions of the brain. This work will impact the field and describes nice methods to target difficult-to-reach brain regions, such as the inferior olive.

      Weaknesses:

      There are some details in the methods that could be explained further. The discussion was very short and could connect the findings in a broader way.

      In the revised manuscript, we provide more methodological details, as requested. We provided as simple as possible explanations in the discussion, so as not to bias further investigations into this novel phenomenon. In particular, we avoid an extended discussion of the gating effect of CF activity on S1 plasticity. While this is the effect on plasticity specifically observed here, we believe that the consequences of CF signaling on S1 activity may entirely depend on the contexts in which CF signals are naturally recruited, the ongoing activity of other brain regions, and behavioral state. Our key finding is that such modulation of neocortical plasticity can occur. How CF signaling controls plasticity of the neocortex in all contexts remains unknown, but needs to be thoughtfully tested in the future.

      Reviewer #3 (Public review):

      Summary:

      The authors developed an interesting novel paradigm to probe the effects of cerebellar climbing fiber activation on short-term adaptation of somatosensory neocortical activity during repetitive whisker stimulation. Normally, RWS potentiated whisker responses in pyramidal cells and weakly suppressed them in interneurons, lasting for at least 1h. Crusii Optogenetic climbing fiber activation during RWS reduced or inverted these adaptive changes. This effect was generally mimicked or blocked with chemogenetic SST or VIP activation/suppression as predicted based on their "sign" in the circuit.

      Strengths:

      The central finding about CF modulation of S1 response adaptation is interesting, important, and convincing, and provides a jumping-off point for the field to start to think carefully about cerebellar modulation of neocortical plasticity.

      Weaknesses:

      The SST and VIP results appeared slightly weaker statistically, but I do not personally think this detracts from the importance of the initial finding (if there are multiple underlying mechanisms, modulating one may reproduce only a fraction of the effect size). I found the suggestion that zona incerta may be responsible for the cerebellar effects on S1 to be a more speculative result (it is not so easy with existing technology to effectively modulate this type of polysynaptic pathway), but this may be an interesting topic for the authors to follow up on in more detail in the future.

      Our interpretation of the anatomical and physiological findings is that a pathway via the ZI is indeed critical for the observed effects. This pathway also represents perhaps the most direct pathway (i.e. least number of synapses connecting the cerebellar nuclei to S1). However, several other direct and indirect pathways are plausible as well and we expect distinct activation requirements and consequences for neurons in the S1 circuit. These are indeed interesting topics for future investigation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 77: "CF transients" is not a standard or widely recognized term. Please use a more precise expression, such as "CF-induced calcium transients."

      We now avoid the use of the term “CF transients” and replaced it with “CF-induced calcium transients.”

      (2) Titer of AAVs injected should be provided.

      AAV titers have been included in an additional data table (Data S9).

      (3) Several citations to the figures are incorrect (for example, "Supplementary Data 2a (Line 398)" does not exist).

      We apologize for the mistakes in this version of the article. Incorrect citations to the figures have been corrected.

      (4) Line 627-628: "The tip of the patch cable was centered over Crus II in all optogenetic stimulation experiments." The stereotaxic coordinate of the tip position should be provided.

      The stereotaxic coordinate of the tip position has been provided in the methods.

      (5) Line 629: "Blue light pulses were delivered with a 470 nm Fiber-Coupled LED (Thorlabs catalog: M470F3)." The size of the light stim and estimated power density (W/mm^2) at the surface of the cortex should be provided.

      The spot size and estimated power density at the surface of the cortex has been provided in the methods.

      (6) Line 702-706: References for DCZ should be cited.

      We now cited Nagai et al, Nat. Neurosci. 23 (2020) as the original reference.

      (7) Two-photon image processing (Line 807-809): The rationale for normalizing ∆F/F traces to a pre-stimulus baseline is unclear because ∆F/F is, by definition, already normalized to baseline fluorescence: (Ft-F0)/F0. The authors should clarify why this additional normalization step was necessary and how it affected the interpretation of the data.

      A single baseline fluorescence value (F₀) was computed for each neuron across the entire recording session, which lasted ~120-minutes. However, some S1 neurons exhibit fluctuations in baseline fluorescence over time—often related to locomotive activity or spontaneous network oscillations—which can obscure stimulus-evoked changes. To isolate fluorescence changes specifically attributable to whisker stimulation, we normalized each ∆F/F trace to the prestimulus baseline for that trial. This additional normalization allowed us to quantify potentiation or depression of sensory responses themselves, independently of spontaneous oscillations or locomotion-related changes in the ongoing neural activity.

      Reviewer #2 (Recommendations for the authors):

      (1) Did the climbing fiber stimulation for Figure 1 result in any changes to motor activity? Can you make any additional comments on other behaviors that were observed during these manipulations?

      Acute CF stimulation did not cause any changes in locomotive or whisking activity. The CF stimulation also did not influence the overall level of locomotion or whisking during plasticity induction.

      (2) Figure 3B and F- it is very difficult to see the SST+ neurons. Can this be enhanced?

      We linearly adjusted the brightness and contrast for the bottom images in Figure 3B and F to improve visualization of SST+ neurons. Note the expression of both hM3D(Gq) and hM4D(Gi) in SST+ neurons is sparse, which was necessary to avoid off-target effects.

      (3) Can you be more specific about the subregions of cerebellar nuclei and cell types that are targeted in the tracing studies? Discussions of the cerebellar nuclei subregions are missing and would be interesting, as others have shown discrete pathways between cerebellar nuclei subregions and long-distance projections.

      See our response to comment 5a from Reviewer 1 (copied again here): we used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophone) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      It would indeed be interesting to further investigate the effect of CFs residing in different cerebellar lobules, which preferentially target different cerebellar nuclei, on targets of these nuclei.

      (4) Did you see any connection to the ventral tegmental area? Can you comment on whether dopamine pathways are influenced by CF and in your manipulations?

      We did not specifically look at these pathways and thus are not able to comment on this.

      (5) These are intensive surgeries, do you think glia could have influenced any results?

      This was not tested and seems unlikely, but we cannot exclude such possibility.

      (6) It is unclear in the methods how long animals were recorded for in each experiment. Can you add more detail?

      Additional detail was added to the methods. Recordings for all experimental configurations did not last more than 120 minutes in total. All data were analyzed across identical time windows for each experiment.

      (7) In the methods it was mentioned that recording length can differ between animals. Can this influence the results, and if so, how was that controlled for?

      There was a variance in recording length within experimental groups, but no systematic difference between groups.

      (8) I do not see any mention of animal sex throughout this manuscript. If animals were mixed groups, were sex differences considered? Would it be expected that CF activity would be different in male and female mice?

      As mentioned in the Methods (Animals), mice of either sex were used. No sex-dependent differences were observed.

      (9) Transsynaptic tracing results of the zona incerta are very interesting. The zona incerta is highly understudied, but has been linked to feeding, locomotion, arousal, and novelty seeking. Do you think this pathway would explain some of the behavioral results found through other studies of cerebellar lobule perturbations? Some discussion of how this brain region would be important as a cerebellar connection in animal behavior would be interesting.

      Since the multi-synaptic pathway from the cerebellum to S1 involves several brain regions with their own inputs and modulatory influences, it seems plausible to assume that behaviors controlled by these regions or affecting signaling pathways that regulate them would show some level of interaction. Our study does not address these interactions, but this will be an interesting question to be addressed in future work.

      Reviewer #3 (Recommendations for the authors):

      General comments on the data presentation:

      I'm not a huge fan of taking areas under curves ('AUC' throughout the study) when the integral of the quantity has no physical meaning - 'normalizing' the AUC (1I,L etc) is even stranger, because of course if you instead normalize the AUC by the # of data points, you literally just get the mean (which is probably what should be used instead).

      Indeed, AUC is equal to the average response in the time window used, multiplied by the window duration (thus, AUC is directly proportional to the mean). We choose to report AUC, a descriptive statistic, rather than the mean within this window. In 1I and L, we normalize the AUC across animals, essentially removing the variability across animals in the ‘Pre’ condition for visualization. Note the significance of these comparisons are consistent whether or not we normalize to the ‘Pre’ condition (non-normalized RWS data in I shows a significant increase in PN activity, p = 0.0068, signrank test; non-normalized RWS+CF data in I shows a significant decrease in PN activity, p = 0.0135, paired t-test; non-normalized RWS data in L shows a significant decrease in IN activity, p <0.001, paired t-test; non-normalized RWS+CF data in L shows no significant change in IN activity, p = 0.7789, paired t-test).

      I think unadorned bar charts are generally excluded from most journals now. Consider replacing these with something that shows the raw datapoints if not too many, or the distribution across points.

      We have replaced bar charts with box plots and violin plots. We have avoided plotting individual data points due to the quantity of points.

      In various places, the statistics produce various questionable outcomes that will draw unwanted reader scrutiny. Many of the examples below involve tiny differences in means with overlapping error bars that are "significant" or a few cases of nonoverlapping error bars that are "not significant." I think replacing the bar charts may help to resolve things here if we can see the whole distribution or the raw data points. As importantly, I think a big problem is that the statistical tests all seem to be nonparametric (they are ambiguously described in Table S3 as "Wilcoxon," which should be clarified, since there is an unpaired Wilcoxon test [rank sum] and a paired Wilcoxon test [sign rank]), and thus based on differences in the *median* whereas the bar charts are based on the *mean* (and SEM rather than MAD or IQR or other medianappropriate measure of spread). This should be fixed (either change the test or change the plots), which will hopefully allay many of the items below.

      We thank the reviewer for this important point. As mentioned in the Statistics and quantification section, Wilcoxon signed rank tests were used for non-normal data. We have replaced the bar charts with box plots which show the IQR and median, which indeed allays may of the items below.

      Here are some specific points on the statistics presentation:

      (1) 1G, the test says that following RWS+CF, the decrease in PN response is not significant. In 1I, the same data, but now over time, shows a highly significant decrease. This probably means that either the first test should be reconsidered (was this a paired comparison, which would "build in" the normalization subsequently used automatically?) or the second test should be reconsidered. It's especially strange because the n value in G, if based on cells, would seem to be ~50-times higher than that in I if based on mice.

      In Figure 1G, the analysis tests whether individual pyramidal neurons significantly changed their responses before vs. after RWS+CF stimulation. This is a paired comparison at the single-cell level, and here indicates that the average per-neuron response did not reliably decrease after RWS+CF when comparing each cell’s pre- and post-values directly. In contrast, Figure 1I examines the same dataset analyzed across time bins using a two-way ANOVA, which tests for effects of time, group (RWS vs. RWS+CF), and their interaction. The analysis showed a significant group effect (p < 0.001), indicating that the overall level of activity across all time points differed between RWS and RWS+CF conditions. The difference in significance between these two analyses arises because the first test (Fig. 1G) assesses within-neuron changes (paired), whereas the second test (Fig. 1I) assesses overall population-level differences between groups over time (independent groups). Thus, the tests address related but distinct questions—one about per-cell response changes, the other about how activity differs across experimental conditions.

      (2) 1J RWS+CF then shows a much smaller difference with overlapping error bars than the ns difference with nonoverlapping errors in 1G, but J gets three asterisks (same n-values).

      Bar graphs have been replaced with box plots.

      (3) 1K, it is very unclear what is under the asterisk could possibly be significant here, since the black and white dots overlap and trade places multiple times.

      See response to point 1. A significant group effect will exist if the aggregate difference across all time bins exceeds within-group variability. The asterisk therefore reflects a statistically significant main group effect (RWS versus RWS+CF) rather than differences at any single time point. Note, however, the very small effect size here.

      (4) 2B, 2G, 2H, 2I, 3G, 3H, 5C etc, again, significance with overlapping error bars, see suggestions above.

      Bar graphs have been replaced with box plots.

      (5) Time windows: e.g., L149-153 / 2B - this section reads weirdly. I think it would be less offputting to show a time-varying significance, if you want to make this point (there are various approaches to this floating around), or a decay rate, or something else.

      Here, we wanted to understand the overall direction of influence of CFs on VIP activity. We find that CFs exert a suppressive effect on VIP activity, which is statistically significant in this later time window. The specific effect of CF modulation on the activity of S1 neurons across multiple time points will be described in more detail in future investigations.

      (6) 4G, 6I, these asterisks again seem impossible (as currently presented).

      Bar graphs have been replaced with box plots.

      The writing is in generally ok shape, but needs tightening/clarifying:

      (1) L45 "mechanistic capacity" not clear.

      We have simplified this term to “capacity.” We use the term here to express that the central question we pose is whether CF signals are able to impact S1 circuits. We demonstrate CF signals indeed influence S1 circuits and further describe the mechanism through which this occurs, but we do not yet know all of the natural conditions in which this may occur. We feel that “capacity” describes the question we pose -- and our findings -- very well.

      (2) L48-58 there's a lot of material here, not clear how much is essential to the present study.

      We would like to give an overview of the literature on instructive CF signaling within the cerebellum. Here, we feel it is important to describe how CFs supervise learning in the cerebellum via coincident activation of parallel fiber inputs and CF inputs. Our results demonstrate CFs have the capacity to supervise learning in the neocortex in a similar manner, as coincident CF activation with sensory input modulates plasticity of S1 neurons.

      (3) L59 "has the capacity to" maybe just "can".

      This has been adopted. We agree that “can” is a more straightforward way of saying “has the capacity to” here. In this sentence, “can” and “has the capacity to” both mean a general ability to do something, without explicit knowledge about the conditions of use.

      (4) L61-62 some of this is circular "observation that CF regulates plasticity in S1..has consequences for plasticity in S1".

      We now changed this to read “…consequences for input processing in S1.”

      (5) L91 "already existing whisker input" although I get it, strictly speaking, not clear what this means.

      This sentence has been reworded for clarity.

      (6) L94 "this form of plasticity" what form?

      Edited to read “sensory-evoked plasticity.”

      (7) L119 should say "to test the".

      This has been corrected.

      (8) L120 should say "well-suited to measure receptive fields".

      We agree; this wording has been adopted.

      (9) L130 should say "optical imaging demonstrated that receptive field".

      This has been adopted.

      (10) L138, the disclaimer is helpful, but wouldn't it be less confusing to just pick a different set of terms? Response potentiation etc.

      Perhaps, but we want to stress that components of LTP and LTD (traditionally tested using electrophysiological methods to specifically measure synaptic gain changes) can be optically measured as long as it is specified what is recorded.

      (11) L140, this whole section is not very clear. What was the experiment? What was done and how?

      The text in this section has been updated.

      (12) L154, 156, 158, 160, 960, what is a "basic response"? Is this supposed to contrast with RWS? If so, I would just say "we measured the response to whisker stimulation without first performing RWS, and compared this to the whisker stimulation with simultaneous CF activation."

      What we meant by “basic response” was the acute response of S1 neurons to a single 100 ms air puff. Here, we indeed measured the acute responses of S1 neurons to whisker stimulation (100 ms air puff) and compared them to whisker stimulation with simultaneous CF activation (100 ms air puff with a 50 ms light pulse; the light pulse was delayed 45 ms with respect to the air puff). This paragraph has been reworded for clarity.

      (13) L156 "comprised of a majority" unclear. You mean most of the nonspecific IN group is either PV or SST?

      Yes, that was meant here. This paragraph has been reworded for clarity.

      (14) L165 tense. "are activated" "we tested" prob should be "were activated."

      This sentence was reworded.

      (15) L173 Not requesting additional experiments, but demonstrating that the effect is mimicked by directly activating SST or suppressing VIP questions the specificity of CF activation per se, versus presumably many other pathways upstream of the same mechanisms, which might be worth acknowledging in the text.

      We indeed observe that directly activating SST or suppressing VIP neurons in S1 is sufficient to mediate the effect of CF activation on S1 pyramidal neurons, implicating SST and VIP neurons as the local effectors of CF signaling. In the text, we wrote “...the notion of sufficiency does not exclude potential effects of plasticity processes elsewhere that might well modulate effector activation in this context and others not yet tested.” Here, we mean that CFs are certainly not the only modulators of the inhibitory network in S1. One example we highlight in the discussion is that projections from M1 are known to modulate this disinhibitory VIP-to-SST-to-PN microcircuit in S1. We conclude from our chemogenetic manipulation experiments that CFs ultimately have the capacity to modulate S1 interneurons, which must occur indirectly (either through the thalamus or “upstream” regions as this reviewer points out). The fact that many other brain regions may also modulate the interneuron network in S1 -- or be modulated by CF activity themselves -- only expands the capacity of CFs to exert a variety of effects on S1 neurons in different contexts.

      (16) L247 "induced ChR2" awkward.

      We changed this to read “we expressed ChR2.”

      (17) 6C, what are the three colors supposed to represent?

      We apologize for the missing labels in this version of the manuscript. Figure 6C and the figure legend have been updated.

  3. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. 21.6. Bibliography# [u1] Plato. Phaedrus: Translated by Benjamin Jowett. January 2013. Page Version ID: 1189255462. [u2] Luddite. December 2023. Page Version ID: 1189255462. URL: https://en.wikipedia.org/w/index.php?title=Luddite&oldid=1189255462 (visited on 2023-12-10). [u3] Ted Chiang. Will A.I. Become the New McKinsey? The New Yorker, May 2023. URL: https://www.newyorker.com/science/annals-of-artificial-intelligence/will-ai-become-the-new-mckinsey (visited on 2023-12-10). [u4] xkcd comics. The Pace of Modern Life. June 2013. URL: https://xkcd.com/1227/ (visited on 2023-12-10). [u5] xkcd comics. 1227: The Pace of Modern Life - explain xkcd. June 2013. URL: https://www.explainxkcd.com/wiki/index.php/1227:_The_Pace_of_Modern_Life (visited on 2023-12-10). [u6] Steven Spielberg. Jurassic Park. June 1993. URL: https://www.imdb.com/title/tt0107290/. [u7] Alex Blechman [@AlexBlechman]. Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus. November 2021. URL: https://twitter.com/AlexBlechman/status/1457842724128833538 (visited on 2023-12-10). [u8] Silicon Valley. April 2014. URL: https://www.imdb.com/title/tt2575988/. [u9] Eli Whitney. December 2023. Page Version ID: 1189351897. URL: https://en.wikipedia.org/w/index.php?title=Eli_Whitney&oldid=1189351897 (visited on 2023-12-10). [u10] Alfred Nobel. December 2023. Page Version ID: 1189282550. URL: https://en.wikipedia.org/w/index.php?title=Alfred_Nobel&oldid=1189282550 (visited on 2023-12-10). [u11] Einstein and the Manhattan Project. URL: https://www.amnh.org/exhibitions/einstein/peace-and-war/the-manhattan-project (visited on 2023-12-10). [u12] Steve Krenzel [@stevekrenzel]. With Twitter's change in ownership last week, I'm probably in the clear to talk about the most unethical thing I was asked to build while working at Twitter. 🧵. November 2022. URL: https://twitter.com/stevekrenzel/status/1589700721121058817 (visited on 2023-12-10). [u13] Britney Nguyen. Ex-Twitter engineer says he quit years ago after refusing to help sell identifiable user data, worries Elon Musk will 'do far worse things with data'. November 2022. URL: https://www.businessinsider.com/former-twitter-engineer-worried-how-elon-musk-treat-user-data-2022-11 (visited on 2023-12-10). [u14] Alphabet Workers Union-Communications Workers of America Local 9009. Our People: Workers are coming together to build power across Alphabet. URL: https://www.alphabetworkersunion.org/our-people (visited on 2023-12-10). [u15] Jason Parham. A People’s History of Black Twitter, Part I. Wired, July 2021. URL: https://www.wired.com/story/black-twitter-oral-history-part-i-coming-together/ (visited on 2023-12-10). [u16] Jason Parham. There Is No Replacement for Black Twitter. Wired, November 2022. URL: https://www.wired.com/story/black-twitter-elon-musk/ (visited on 2023-12-10). [u17] Catherine Buni. Media, company, behemoth: What, exactly, is Facebook? November 2016. URL: https://www.theverge.com/2016/11/16/13655102/facebook-journalism-ethics-media-company-algorithm-tax (visited on 2023-12-10). [u18] Rafi Letzter. A teenager on TikTok disrupted thousands of scientific studies with a single video. September 2021. URL: https://www.theverge.com/2021/9/24/22688278/tiktok-science-study-survey-prolific (visited on 2023-12-10). [u19] Catherine D'Ignazio and Lauren F. Klein. Data Feminism. Strong Ideas. MIT Libraries Experimental Collections Fund, Cambridge, 1 edition, 2020. ISBN 978-0-262-04400-4. URL: https://direct.mit.edu/books/oa-monograph/4660/Data-Feminism, doi:10.7551/mitpress/11805.001.0001. [u20] Janet Abbate. Recoding Gender: Women's Changing Participation in Computing. MIT Press, Cambridge, UNITED STATES, 2012. ISBN 978-0-262-30546-4. URL: http://ebookcentral.proquest.com/lib/washington/detail.action?docID=3339524 (visited on 2023-12-10). [u21] Mar Hicks. Programmed Inequality: How Britain Discarded Women Technologists and Lost Its Edge in Computing. MIT Press, Cambridge, UNITED STATES, 2017. ISBN 978-0-262-34294-0. URL: http://ebookcentral.proquest.com/lib/washington/detail.action?docID=6246618 (visited on 2023-12-10). [u22] Charlton D. McIlwain. Black software: the internet and racial justice, from the AfroNet to Black Lives Matter. 2020. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162262159401452. [u23] Simone Browne. Dark Matters: On the Surveillance of Blackness. Duke University Press, September 2015. ISBN 978-0-8223-7530-2. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161921055701452 (visited on 2023-12-10), doi:10.1215/9780822375302. [u24] Safiya Umoja Noble. Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press, New York, UNITED STATES, 2018. ISBN 978-1-4798-3364-1. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162068349301452 (visited on 2023-12-10). [u25] Shalini Kantayya. Coded Bias. November 2020. URL: https://www.netflix.com/title/81328723 (visited on 2023-12-10). [u26] Tarleton Gillespie. Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. Yale University Press, New Haven, UNITED STATES, 2018. ISBN 978-0-300-23502-9. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162362661601452 (visited on 2023-12-10). [u27] Sarah T. Roberts. Behind the screen: content moderation in the shadows of social media. 2019. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162217744201452. [u28] Jean Burgess, Alice Marwick, and Thomas Poell. The SAGE Handbook of Social Media. SAGE Publications, 55 City Road, London, 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162105658401452 (visited on 2023-12-10), doi:10.4135/9781473984066. [u29] Yuri Takhteyev. Coding Places: Software Practice in a South American City. The MIT Press, September 2012. ISBN 978-0-262-30559-4. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161981926801452 (visited on 2023-12-10), doi:10.7551/mitpress/9109.001.0001. [u30] Virginia Eubanks. Automating inequality: how high-tech tools profile, police, and punish the poor. 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162064355601452. [u31] Mary L. Gray and Siddharth Suri. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt Publishing Company, Boston, United States, 2019. ISBN 978-1-328-56628-7. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162207131801452 (visited on 2023-12-10). [u32] Shoshana Zuboff. The age of surveillance capitalism: the fight for a human future at the new frontier of power. 2019. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162177355601452. [u33] Cathy O'Neil. Weapons of math destruction: how big data increases inequality and threatens democracy. 2016. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161951137601452. [u34] Sasha Costanza-Chock. Design justice: community-led practices to build the worlds we need. Information policy series. The MIT Press, Cambridge, Massachesetts, 2020. ISBN 978-0-262-35686-2. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162363060401452. [u35] Thomas S. Mullaney, Benjamin Peters, Mar Hicks, and Kavita Philip. Your computer is on fire. The MIT Press, Cambridge, Massachusetts, 2021. ISBN 978-0-262-36077-7. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162423945901452, doi:10.7551/mitpress/10993.001.0001. [u36] Sara Wachter-Boettcher. Technically wrong: sexist apps, biased algorithms, and other threats of toxic tech. October 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99329653362401451. [u37] Saunders, Joe and Carl Fox, editors. Media Ethics, Free Speech, and the Requirements of Democracy. Routledge, New York, December 2018. ISBN 978-0-203-70244-4. URL: https://www.taylorfrancis.com/books/edit/10.4324/9780203702444/media-ethics-free-speech-requirements-democracy-carl-fox-joe-saunders, doi:10.4324/9780203702444. [u38] Ruha Benjamin. Viral Justice: How We Grow the World We Want. Princeton University Press, October 2022. ISBN 978-0-691-22288-2. URL: https://press.princeton.edu/books/hardcover/9780691222882/viral-justice (visited on 2023-12-10). [u39] Meta for Developers. 2023. URL: https://developers.facebook.com/ (visited on 2023-12-10). [u40] API Reference — Facebook SDK for Python 4.0.0-pre documentation. 2015. URL: https://facebook-sdk.readthedocs.io/en/latest/api.html (visited on 2023-12-10). [u41] TikTok for Developers. 2023. URL: https://developers.tiktok.com/ (visited on 2023-12-10). [u42] Getting started with Official Account Developer Mode. January 2013. URL: https://developers.weixin.qq.com/doc/offiaccount/en/Getting_Started/Getting_Started_Guide.html (visited on 2023-12-10).

      After checking out Coded Bias, I was honestly surprised how much everyday technology relies on algorithms that were never tested on diverse groups of people. The documentary shows how facial recognition failed on darker-skinned women, which made me think about how “neutral” tech isn’t neutral at all. What really got me is how the developers didn’t seem to think about these consequences until people called them out. It connects perfectly to the chapter’s theme that innovation often ignores ethics until harm already happens. It also made me wonder how many other systems we use every day have hidden biases we just haven’t noticed yet.

    2. Ted Chiang. Will A.I. Become the New McKinsey? The New Yorker, May 2023. URL: https://www.newyorker.com/science/annals-of-artificial-intelligence/will-ai-become-the-new-mckinsey (visited on 2023-12-10).

      I appreciate how Chiang reframes the fear of AI “taking over” by comparing it to management-consulting logic rather than superintelligence. His argument that powerful institutions often use technology as a justification for harmful decisions — rather than technology making those decisions itself — really stuck with me. It made me think about how often companies claim, “The algorithm says we have to do this,” the same way executives once said, “McKinsey says we have to cut costs.”

    1. “I believe that this isan important test of the separation of church and state as we may see inour lifetime—as important a test—and it is critically important that weget it right” (Bloomberg ). His argument that the government should notprohibit people from worshiping as they wish could have been made with-out these exigent circumstances, but their inclusion changes the tone fromone of a defensive posture to a more vigorous one.

      I think that the separation of church and state is an important standard that our government should follow, i also think that the way the writer uses the text to show this really helps to prove that point

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank all reviewers for their constructive and in-depth reviews. Thanks to your feedback, we realized that the main objective of the paper was not presented clearly enough, and that our use of the same “modality-agnostic” terminology for both decoders and representations caused confusion. We addressed these two major points as outlined in the following. 

      In the revised manuscript, we highlight that the main contribution of this paper is to introduce modality-agnostic decoders. Apart from introducing this new decoder type, we put forward their advantages in comparison to modality-specific decoders in terms of decoding performance and analyze the modality-invariant representations (cf. updated terminology in the following paragraph) that these decoders rely on. The dataset that these analyses are based on is released as part of this paper, in the spirit of open science (but this dataset is only a secondary contribution for our paper). 

      Regarding the terminology, we clearly define modality-agnostic decoders as decoders that are trained on brain imaging data from subjects exposed to stimuli in multiple modalities. The decoder is not given any information on which modality a stimulus was presented in, and is therefore trained to operate in a modality-agnostic way. In contrast, modality-specific decoders are trained only on data from a single stimulus modality. These terms are explained in Figure 2. While these terms describe different ways of how decoders can be trained, there are also different ways to evaluate them afterwards (see also Figure 3); but obviously, this test-time evaluation does not change the nature of the decoder, i.e., there is no contradiction in applying a modality-specific decoder to brain data from a different modality.

      Further, we identify representations that are relevant for modality-agnostic decoders using the searchlight analysis. We realized that our choice of using the same “modality-agnostic” term to describe these brain representations created unnecessary debate and confusion. In order to not conflate the terminology, in the updated manuscript we call these representations modality-invariant (and the opposite modality-dependent). Our methodology does not allow us to distinguish whether certain representations merely share representational structure to a certain degree, or are truly representations that abstract away from any modality-dependent information. However, in order to be useful for modality-agnostic decoding, a significant degree of shared representational structure is sufficient, and it is this property of brain representations that we now define as “modality-invariant”. 

      We updated the manuscript in line with this new terminology and focus: in particular, the first Related Work section on Modality-invariant brain representations, as well as the Introduction and Discussion.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors introduce a densely-sampled dataset where 6 participants viewed images and sentence descriptions derived from the MS Coco database over the course of 10 scanning sessions. The authors further showcase how image and sentence decoders can be used to predict which images or descriptions were seen, using pairwise decoding across a set of 120 test images. The authors find decodable information widely distributed across the brain, with a left-lateralized focus. The results further showed that modality-agnostic models generally outperformed modality-specific models, and that data based on captions was not explained better by caption-based models but by modality-agnostic models. Finally, the authors decoded imagined scenes.

      Strengths:

      (1) The dataset presents a potentially very valuable resource for investigating visual and semantic representations and their interplay.

      (2) The introduction and discussion are very well written in the context of trying to understand the nature of multimodal representations and present a comprehensive and very useful review of the current literature on the topic.

      Weaknesses:

      (1) The paper is framed as presenting a dataset, yet most of it revolves around the presentation of findings in relation to what the authors call modality-agnostic representations, and in part around mental imagery. This makes it very difficult to assess the manuscript, whether the authors have achieved their aims, and whether the results support the conclusions.

      Thanks for this insightful remark. The dataset release is only a secondary contribution of our study; this was not clear enough in the previous version. We updated the manuscript to make the main objective of the paper more clear, as outlined in our general response to the reviews (see above).

      (2) While the authors have presented a potential use case for such a dataset, there is currently far too little detail regarding data quality metrics expected from the introduction of similar datasets, including the absence of head-motion estimates, quality of intersession alignment, or noise ceilings of all individuals.

      As already mentioned in the general response, the main focus of the paper is to introduce modality-agnostic decoders. The dataset is released in addition, this is why we did not focus on reporting extensive quality metrics in the original manuscript. To respond to your request, we updated the appendix of the manuscript to include a range of data quality metrics. 

      The updated appendix includes head motion estimates in the form of realignment parameters and framewise displacement, as well as a metric to assess the quality of intersession alignment. More detailed descriptions can be found in Appendix 1 of the updated manuscript.

      Estimating noise ceilings based on repeated presentations of stimuli (as for example done in Allen et al. (2022)) requires multiple betas for each stimulus. All training stimuli were only presented once, so this could only be done for the test stimuli which were presented repeatedly. However, during our preprocessing procedure we directly calculated stimulus-specific betas based on data from all sessions using one single GLM, which means that we did not obtain separate betas for repeated presentations of the same stimulus. We will however share the raw data publicly, so that such noise ceilings can be calculated using an adapted preprocessing procedure if required.

      Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., Nau, M., Caron, B., Pestilli, F., Charest, I., Hutchinson, J. B., Naselaris, T., & Kay, K. (2022). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116–126. https://doi.org/10.1038/s41593-021-00962-x

      (3) The exact methods and statistical analyses used are still opaque, making it hard for a reader to understand how the authors achieved their results. More detail in the manuscript would be helpful, specifically regarding the exact statistical procedures, what tests were performed across, or how data were pooled across participants.

      In the updated manuscript, we improved the level of detail for the descriptions of statistical analyses wherever possible (see also our response to your “Recommendations for the authors”, Point 6).

      Regarding data pooling across participants: 

      Figure 8 shows averaged results across all subjects (as indicated in the caption)

      Regarding data pooling for the estimation of the significance threshold of the searchlight analysis for modality-invariant regions: We updated the manuscript to clarify that we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution: “For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results.”

      Additionally, we indicated that the same permutation testing methods were applied to assess the significance threshold for the imagery decoding searchlight maps (Figure 10). 

      (4) Many findings (e.g., Figure 6) are still qualitative but could be supported by quantitative measures.

      The Figures 6 and 7 are intentionally qualitative results to support the quantitative decoding results presented in Figures 4 and 5. (see also Reviewer 2 Comment 2)

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (5) Results are significant in regions that typically lack responses to visual stimuli, indicating potential bias in the classifier. This is relevant for the interpretation of the findings. A classification approach less sensitive to outliers (e.g., 70-way classification) could avoid this issue. Given the extreme collinearity of the experimental design, regressors in close temporal proximity will be highly similar, which could lead to leakage effects.

      It is true that our searchlight analysis revealed significant activity in regions outside of the visual cortex. However, it is assumed that the processing of visual information does not stop at the border of the visual cortex. The integration of information such as the semantics of the image is progressively processed in other higher-level regions of the brain. Recent studies have shown that activity in large areas of the cortex (including many outside of the visual cortex) can be related to visual stimulation (Solomon et al. 2024; Raugel et al. 2025). Our work confirms this finding and we therefore do not see reason to believe that this is due to a bias in our decoders.

      Further, you are suggesting that we could replace our regression approach with a 70-way classification. However, this is difficult using our fMRI data as we do not see a straightforward way to assign the training and testing stimuli with class labels (the two datasets consist of non-overlapping sets of naturalistic images).

      To address your concerns regarding the collinearity of the experimental design and possible leakage effects, we trained and evaluated a decoder for one subject after running a “null-hypothesis” adapted preprocessing. More specifically, for all sessions, we shifted the functional data of all runs by one run (moving the data of the last run to the very front), but leaving the design matrices in place. Thereby, we destroyed the relationship of stimuli and brain activity but kept the original data and design with its collinearity (and possible biases). We preprocessed this adapted data for subject 1, and ran a whole-brain decoding using Imagebind features and verified that the decoding performance was at chance level:  Pairwise accuracy (captions): 0.43 | Pairwise accuracy (images): 0.47 | Pairwise accuracy (imagery): 0.50. This result provides evidence against the notion that potential collinearity or biases in our experimental design or evaluation procedure could have led to inflated results.

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      Solomon, S. H., Kay, K., & Schapiro, A. C. (2024). Semantic plasticity across timescales in the human brain. bioRxiv, 2024-02.

      (6) The manuscript currently lacks a limitations section, specifically regarding the design of the experiment. This involves the use of the overly homogenous dataset Coco, which invites overfitting, the mixing of sentence descriptions and visual images, which invites imagery of previously seen content, and the use of a 1-back task, which can lead to carry-over effects to the subsequent trial.

      Regarding the dataset CoCo: We agree that CoCo is somewhat homogenous, it is however much more diverse and naturalistic than the smaller datasets used in previous fMRI experiments with multimodal stimuli. Additionally, CoCo has been widely adopted as a benchmark dataset in the Machine Learning community, and features rich annotations for each image (e.g. object labels, segmentations, additional captions, people’s keypoints) facilitating many more future analyses based on our data.

      Regarding the mixing of sentence descriptions and images: Subjects were not asked to visualize sentences and different techniques for the one-back tasks might have been used. Generally, we do not see it as problematic if subjects are performing visual imagery to some degree while reading sentences, and this might even be the case during normal reading as well. A more targeted experiment comparing reading with and without interleaved visual stimulation in the form of images and a one-back task would be required to assess this, but this was not the focus of our study. For now, it is true that we can not be sure that our results generalize to cases in which subjects are just reading and are less incentivized to perform mental imagery.

      Regarding the use of a 1-back task: It was necessary to make some design choices in order to realize this large-scale data collection with approximately 10 hours of recording per subject. Specifically, the 1-back task was included in the experimental setup in order to assure continuous engagement of the participant during the rather long sessions of 1 hour. The subjects did indeed need to remember the previous stimulus to succeed at the 1-back task, which means that some brain activity during the presentation of a stimulus is likely to be related to the previous stimulus. We aimed to account for this confound during the preprocessing stage when fitting the GLM, which was fit to capture only the response to the presented image/caption, not the preceding one. Still, it might have picked up on some of the activity from preceding stimuli, causing some decrease of the final decoding performance.

      We added a limitations section to the updated manuscript to discuss these important issues.

      (7) I would urge the authors to clarify whether the primary aim is the introduction of a dataset and showing the use of it, or whether it is the set of results presented. This includes the title of this manuscript. While the decoding approach is very interesting and potentially very valuable, I believe that the results in the current form are rather descriptive, and I'm wondering what specifically they add beyond what is known from other related work. This includes imagery-related results. This is completely fine! It just highlights that a stronger framing as a dataset is probably advantageous for improving the significance of this work.

      Thanks a lot for pointing this out. Based on this comment and feedback from the other reviewers we restructured the abstract, introduction and discussion section of the paper to better reflect the primary aim. (cf. general response above).

      You further mention that it is not clear what our results add beyond what is known from related work. We list the main contributions here:

      A single modality-agnostic decoder can decode the semantics of visual and linguistic stimuli irrespective of the presentation modality with a performance that is not lagging behind modality-specific decoders.

      Modality-agnostic decoders outperform modality-specific decoders for decoding captions and mental imagery.

      Modality-invariant representations are widespread across the cortex (a range of previous work has suggested they were much more localized (Bright et al. 2004; Jung et al. 2018; Man et al. 2012; Simanova et al. 2014).

      Regions that are useful for imagery are largely overlapping with modality-invariant regions

      Bright, P., Moss, H., & Tyler, L. K. (2004). Unitary vs multiple semantics: PET studies of word and picture processing. Brain and language, 89(3), 417-432.

      Jung, Y., Larsen, B., & Walther, D. B. (2018). Modality-Independent Coding of Scene Categories in Prefrontal Cortex. Journal of Neuroscience, 38(26), 5969–5981.

      Liuzzi, A. G., Bruffaerts, R., Peeters, R., Adamczuk, K., Keuleers, E., De Deyne, S., Storms, G., Dupont, P., & Vandenberghe, R. (2017). Cross-modal representation of spoken and written word meaning in left pars triangularis. NeuroImage, 150, 292–307. https://doi.org/10.1016/j.neuroimage.2017.02.032

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Simanova, I., Hagoort, P., Oostenveld, R., & van Gerven, M. A. J. (2014). Modality-Independent Decoding of Semantic Information from the Human Brain. Cerebral Cortex, 24(2), 426–434.

      Reviewer #2 (Public review):

      Summary:

      This study introduces SemReps-8K, a large multimodal fMRI dataset collected while subjects viewed natural images and matched captions, and performed mental imagery based on textual cues. The authors aim to train modality-agnostic decoders--models that can predict neural representations independently of the input modality - and use these models to identify brain regions containing modality-agnostic information. They find that such decoders perform comparably or better than modality-specific decoders and generalize to imagery trials.

      Strengths:

      (1) The dataset is a substantial and well-controlled contribution, with >8,000 image-caption trials per subject and careful matching of stimuli across modalities - an essential resource for testing theories of abstract and amodal representation.

      (2) The authors systematically compare unimodal, multimodal, and cross-modal decoders using a wide range of deep learning models, demonstrating thoughtful experimental design and thorough benchmarking.

      (3) Their decoding pipeline is rigorous, with informative performance metrics and whole-brain searchlight analyses, offering valuable insights into the cortical distribution of shared representations.

      (4) Extension to mental imagery decoding is a strong addition, aligning with theoretical predictions about the overlap between perception and imagery.

      Weaknesses:

      While the decoding results are robust, several critical limitations prevent the current findings from conclusively demonstrating truly modality-agnostic representations:

      (1) Shared decoding ≠ abstraction: Successful decoding across modalities does not necessarily imply abstraction or modality-agnostic coding. Participants may engage in modality-specific processes (e.g., visual imagery when reading, inner speech when viewing images) that produce overlapping neural patterns. The analyses do not clearly disambiguate shared representational structure from genuinely modality-independent representations. Furthermore, in Figure 5, the modality-agnostic encoder did not perform better than the modality-specific decoder trained on images (in decoding images), but outperformed the modality-specific decoder trained on captions (in decoding captions). This asymmetry contradicts the premise of a truly "modality-agnostic" encoder. Additionally, given the similar performance between modality-agnostic decoders based on multimodal versus unimodal features, it remains unclear why neural representations did not preferentially align with multimodal features if they were truly modality-independent.

      We agree that successful modality-agnostic and cross-modal decoding does not necessarily imply that abstract patterns were decoded. In the updated manuscript, we therefore refer to these representations as modality-invariant (see also the updated terminology explained in the general response above).

      If participants are performing mental imagery when reading, and this is allowing us to perform cross-decoding, then this means that modality-invariant representations are formed during this mental imagery process, i.e. that the representations formed during this form of mental imagery are compatible with representations during visual perception (or, in your words, produce overlapping neural patterns). While we can not know to what extent people were performing mental imagery while reading (or having inner speech while viewing images), our results demonstrate that their brain activity allows for decoding across modalities, which implies that modality-invariant representations are present.

      It is true that our current analyses can not disambiguate modality-invariant representations (or, in your words, shared representational structure) from abstract representations (in your words, genuinely modality-independent representations). As the main goal of the paper was to build modality-agnostic decoders, and these only require what we call “modality-invariant” representations (see our updated terminology in the general reviewer response above), we leave this question open for future work. We do however discuss this important limitation in the Discussion section of the updated manuscript.

      Regarding the asymmetry of decoding results when comparing modality-agnostic decoders with the two respective modality-specific decoders for captions and images: We do not believe that this asymmetry contradicts the premise of a modality-agnostic decoder. Multiple explanations for this result are possible: (1) The modality-specific decoder for images might benefit from the more readily decodable lower-level modality-dependent neural activity patterns in response to images, which are less useful for the modality-agnostic decoder because they are not useful for decoding caption trials. The modality-specific decoders for captions might not be able to pick up on low-level modality-dependent neural activity patterns as these might be less easily decodable. 

      The signal-to-noise ratio for caption trials might be lower than for image trials (cf. generally lower caption decoding performance), therefore the addition of training data (even if it is from another modality) improves the decoding performance for captions, but not for images (which might be at ceiling already).

      Regarding the similar performance between modality-agnostic decoders based on multimodal versus unimodal features: Unimodal features are based on rather high-level features of the respective modality (e.g. last-layer features of a model trained for semantic image classification), which can be already modality-invariant to some degree. Additionally, as already mentioned before, in the updated manuscript we only require representations to be modality-invariant and not necessarily abstract.

      (2) The current analysis cannot definitively conclude that the decoder itself is modality-agnostic, making "Qualitative Decoding Results" difficult to interpret in this context. This section currently provides illustrative examples, but lacks systematic quantitative analyses.

      The qualitative decoding results in Figures 6 and 7 present exemplary qualitative results for the quantitative results presented in Figures 4 and 5 (see also Reviewer 1 Comment 4).

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (3) The use of mental imagery as evidence for modality-agnostic decoding is problematic.

      Imagery involves subjective, variable experiences and likely draws on semantic and perceptual networks in flexible ways. Strong decoding in imagery trials could reflect semantic overlap or task strategies rather than evidence of abstraction.

      It is true that mental imagery does not necessarily rely on modality-agnostic representations. In the updated manuscript we revised our terminology and refer to the analyzed representations as modality-invariant, which we define as “representations that significantly overlap between modalities”. 

      The manuscript presents a methodologically sophisticated and timely investigation into shared neural representations across modalities. However, the current evidence does not clearly distinguish between shared semantics, overlapping unimodal processes, and true modality-independent representations. A more cautious interpretation is warranted.

      Nonetheless, the dataset and methodological framework represent a valuable resource for the field.

      We fully agree with these observations, and updated our terminology as outlined in the general response.

      Reviewer #3 (Public review):

      Summary:

      The authors recorded brain responses while participants viewed images and captions. The images and captions were taken from the COCO dataset, so each image has a corresponding caption, and each caption has a corresponding image. This enabled the authors to extract features from either the presented stimulus or the corresponding stimulus in the other modality.

      The authors trained linear decoders to take brain responses and predict stimulus features.

      "Modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. The decoders were evaluated on brain responses while the participants viewed and imagined new stimuli, and prediction performance was quantified using pairwise accuracy. The authors reported the following results:

      (1) Decoders trained on brain responses to both images and captions can predict new brain responses to either modality.

      (2) Decoders trained on brain responses to both images and captions outperform decoders trained on brain responses to a single modality.

      (3) Many cortical regions represent the same concepts in vision and language.

      (4) Decoders trained on brain responses to both images and captions can decode brain responses to imagined scenes.

      Strengths:

      This is an interesting study that addresses important questions about modality-agnostic representations. Previous work has shown that decoders trained on brain responses to one modality can be used to decode brain responses to another modality. The authors build on these findings by collecting a new multimodal dataset and training decoders on brain responses to both modalities.

      To my knowledge, SemReps-8K is the first dataset of brain responses to vision and language where each stimulus item has a corresponding stimulus item in the other modality. This means that brain responses to a stimulus item can be modeled using visual features of the image, linguistic features of the caption, or multimodal features derived from both the image and the caption. The authors also employed a multimodal one-back matching task, which forces the participants to activate modality-agnostic representations. Overall, SemReps-8K is a valuable resource that will help researchers answer more questions about modality-agnostic representations.

      The analyses are also very comprehensive. The authors trained decoders on brain responses to images, captions, and both modalities, and they tested the decoders on brain responses to images, captions, and imagined scenes. They extracted stimulus features using a range of visual, linguistic, and multimodal models. The modeling framework appears rigorous, and the results offer new insights into the relationship between vision, language, and imagery. In particular, the authors found that decoders trained on brain responses to both images and captions were more effective at decoding brain responses to imagined scenes than decoders trained on brain responses to either modality in isolation. The authors also found that imagined scenes can be decoded from a broad network of cortical regions.

      Weaknesses:

      The characterization of "modality-agnostic" and "modality-specific" decoders seems a bit contradictory. There are three major choices when fitting a decoder: the modality of the training stimuli, the modality of the testing stimuli, and the model used to extract stimulus features. However, the authors characterize their decoders based on only the first choice-"modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. I think that this leads to some instances where the conclusions are inconsistent with the methods and results.

      In our analysis setup, a decoder is entirely determined by two factors: (1) the modality of the stimuli that the subject was exposed to, and (2) the machine learning model used to extract stimulus features.

      The modality of the testing stimuli defines whether we are evaluating the decoder in a within-modality or cross-modality setting, but is not an inherent characteristic of a trained decoder

      First, the authors suggest that "modality-specific decoders are not explicitly encouraged to pick up on modality-agnostic features during training" (line 137) while "modality-agnostic decoders may be more likely to leverage representations that are modality-agnostic" (line 140). However, whether a decoder is required to learn modality-agnostic representations depends on both the training responses and the stimulus features. Consider the case where the stimuli are represented using linguistic features of the captions. When you train a "modality-specific" decoder on image responses, the decoder is forced to rely on modality-agnostic information that is shared between the image responses and the caption features. On the other hand, when you train a "modality-agnostic" decoder on both image responses and caption responses, the decoder has access to the modality-specific information that is shared by the caption responses and the caption features, so it is not explicitly required to learn modality-agnostic features. As a result, while the authors show that "modality-agnostic" decoders outperform "modality-specific" decoders in most conditions, I am not convinced that this is because they are forced to learn more modality-agnostic features.

      It is true that for example a modality-specific decoder trained on fmri data from images with stimulus features extracted from captions might also rely on modality-invariant features. We still call this decoder modality-specific, as it has been trained to decode brain activity recorded from a specific stimulus modality. In the updated manuscript we corrected the statement that “modality-specific decoders are not explicitly encouraged to pick up on modality-invariant features during training” to include the case of decoders trained on features from the other modality which might also rely on modality-invariant features.

      It is true that a modality-agnostic decoder can also have access to modality-dependent information for captions and images. However, as it is trained jointly with both modalities and the modality-dependent features are not compatible, it is encouraged to rely on modality-invariant features. The result that modality-agnostic decoders are outperforming modality-specific decoders trained on captions for decoding captions confirms this, because if the decoder was only relying on modality-dependent features the addition of additional training data from another stimulus modality could not increase the performance. (Also, the lack of a performance drop compared to modality-specific decoders trained on images is only possible thanks to the reliance on modality-invariant features. If the decoder only relied on modality-dependent features the addition of data from another modality would equal an addition of noise to the training data which must result in a performance drop at test time.). We can not exclude the possibility that modality-agnostic decoders are also relying on modality-dependent features, but our results suggest that they are relying at least to some degree on modality-invariant features.

      Second, the authors claim that "modality-specific decoders can be applied only in the modality that they were trained on, while "modality-agnostic decoders can be applied to decode stimuli from multiple modalities, even without knowing a priori the modality the stimulus was presented in" (line 47). While "modality-agnostic" decoders do outperform "modality-specific" decoders in the cross-modality conditions, it is important to note that "modality-specific" decoders still perform better than expected by chance (figure 5). It is also important to note that knowing about the input modality still improves decoding performance even for "modality-agnostic" decoders, since it determines the optimal feature space-it is better to decode brain responses to images using decoders trained on image features, and it is better to decode brain responses to captions using decoders trained on caption features.

      Thanks for this important remark. We corrected this statement and now say that “modality-specific decoders that are trained to be applied only in the modality that they were trained on”, highlighting that their training process optimizes them for decoding in a specific modality. They can indeed be applied to the other modality at test time, this however results in a substantial performance drop.

      It is true that knowing the input modality can improve performance even for modality-agnostic decoders. This can most likely be explained by the fact that in that case the decoder can leverage both, modality-invariant and modality-dependent features. We will not further focus on this result however as the main motivation to build modality-agnostic decoders is to be able to decode stimuli without knowing the stimulus modality a priori. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I will list additional recommendations below in no specific order:

      (1) I find the term "modality agnostic" quite unusual, and I believe I haven't seen it used outside of the ML community. I would urge the authors to change the terminology to be more common, or at least very early explain why the term is much better suited than the range of existing terms. A modality agnostic representation implies that it is not committed to a specific modality, but it seems that a representation cannot be committed to something.

      In the updated manuscript we now refer to the identified brain patterns as modality-invariant, which has previously been used in the literature (Man et al. 2012; Devereux et al. 2013; Patterson et al. 2016; Deniz et al. 2019, Nakai et al. 2021) (see also the general response on top and the Introduction and Related Work sections of the updated manuscript).

      We continue to refer to the decoders as modality-agnostic, as this is a new type of decoder, and describes the fact that they are trained in a way that abstracts away from the modality of the stimuli. We chose this term as we are not aware of any work in which brain decoders were trained jointly on multiple stimulus modalities and in order not to risk contradictions/confusions with other definitions.

      Deniz, F., Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). The Representation of Semantic Information Across Human Cerebral Cortex During Listening Versus Reading Is Invariant to Stimulus Modality. Journal of Neuroscience, 39(39), 7722–7736. https://doi.org/10.1523/JNEUROSCI.0675-19.2019

      Devereux, B. J., Clarke, A., Marouchos, A., & Tyler, L. K. (2013). Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48).

      Nakai, T., Yamaguchi, H. Q., & Nishimoto, S. (2021). Convergence of Modality Invariance and Attention Selectivity in the Cortical Semantic Circuit. Cerebral Cortex, 31(10), 4825–4839. https://doi.org/10.1093/cercor/bhab125

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Patterson, K., & Lambon Ralph, M. A. (2016). The Hub-and-Spoke Hypothesis of Semantic Memory. In Neurobiology of Language (pp. 765–775). Elsevier. https://doi.org/10.1016/B978-0-12-407794-2.00061-4

      (2) The table in Figure 1B would benefit from also highlighting the number of stimuli that have overlapping captions and images.

      The number of overlapping stimuli is rather small (153-211 stimuli depending on the subject). We added this information to Table 1B. 

      (3) The authors wrote that training stimuli were presented only once, yet they used a one-back task. Did the authors also exclude the first presentation of these stimuli?

      Thanks for pointing this out. It is indeed true that some training stimuli were presented more than once, but only for the case of one-back target trials. In these cases the second presentation of the stimulus was excluded, but not the first. As the subject can not be aware of the fact that the upcoming presentation is going to be a one-back target, the first presentation can not be affected by the presence of the subsequent repeated presentation. We updated the manuscript to clarify this issue.

      (4) Coco has roughly 80-90 categories, so many image captions will be extremely similar (e.g., "a giraffe walking", "a surfer on a wave", etc.). How can people keep these apart?

      It is true that some captions and images are highly similar even though they are not matching in the dataset. This might result in several false button presses because the subjects identified an image-caption pair as matching when in fact it wasn't intended to. However, as there was no feedback given on the task performance, this issue should not have had a major influence on the brain activity of the participants.

      (5) Footnotes for statistics are quite unusual - could the authors integrate statistics into the text?

      Thanks for this remark, in the updated manuscript all statistics are part of the main text.

      (6) It may be difficult to achieve the assumptions of a permutation test - exchangeability, which may bias statistical results. It is not uncommon for densely sampled datasets to use bootstrap sampling on the predictions of the test data to identify if a given percentile of that distribution crosses 0. The lowest p-value is given by the number of bootstrap samples (e.g., if all 10,000 bootstrap samples are above chance, then p < 0.0001). This may turn out to be more effective.

      Thanks for this comment. Our statistical procedure was in fact involving a bootstrapping procedure to generate a null distribution on the group-level. We updated the manuscript to describe this method in more detail. Here is the updated paragraph: “To estimate the statistical significance of the resulting clusters we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution see also Stelzer et al., 2013). For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results. We ensured that every permutation was unique, i.e. no two permutations were based on the same combination of selected chance-level results. Based on this null distribution, we calculated p-values for each vertex by calculating the proportion of sampled permutations where the TFCE value was greater than the observed TFCE value. To control for multiple comparisons across space, we always considered the maximum TFCE score across vertices for each group-level permutation (Smith and Nichols, 2009).”

      (7) The authors present no statistical evidence for some of their claims (e.g., lines 335-337). It would be good if they could complement this in their description. Further, the visualization in Figure 4 is rather opaque. It would help if the authors could add a separate bar for the average modality-specific and modality-agnostic decoders or present results in a scatter plot, showing modality-specific on the x-axis and modality-agnostic on the y-axis and color-code the modality (i.e., making it two scatter colors, one for images, one for captions). All points will end up above the diagonal.

      We updated the manuscript and added statistical evidence for the claims made:

      We now report results for the claim that when considering the average decoding performance for images and captions, modality-agnostic decoders perform better than modality-specific decoders, irrespective of the features that the decoders were trained on.

      Additionally, we report the average modality-agnostic and modality-specific decoding accuracies corresponding to Figure 4. For modality-agnostic decoders the average value is 81.86\%, for modality-specific decoders trained on images 78.15\%, and for modality-specific decoders trained on captions 72.52\%. We did not add a separate bar to Figure 4 as this would add additional information to a Figure which is already very dense in its information content (cf. Reviewers 2’s recommendations for the authors). We therefore believe it is more useful to report the average values in the text and provide results for a statistical test comparing the decoder types. A scatter plot would make it difficult to include detailed information on the features, which we believe is crucial.

      We further provide statistical evidence for the observation regarding the directionality of cross-modal decoding.

      Reviewer #2 (Recommendations for the authors):

      For achieving more evidence to support modality-agnostic representations in the brain, I suggest more thorough analyses, for example:

      (1) Traditional searchlight RSA using different deep learning models. Through this approach, it might identify different brain areas that are sensitive to different formats of information (visual, text, multimodal); subsequently, compare the decoding performance using these ROIs.

      (2) Build more dissociable decoders for information of different modality formats, if possible. While I do not have a concrete proposal, more targeted decoder designs might better dissociate representational formats (i.e., unimodal vs. modality-agnostic).

      (3) A more detailed exploration of the "qualitative decoding results"--for example, quantitatively examining error types produced by modality-agnostic versus modality-specific decoders--would be informative for clarifying what specific content the decoder captures, potentially providing stronger evidence for modality-agnostic representations.

      Thanks for these suggestions. As the main goal of the paper is to introduce modality-agnostic decoders (which should be more clear from the updated manuscript, see also the general response to reviews), we did not include alternative methods for identifying modality-invariant regions. Nonetheless, we agree that in order to obtain more in-depth insight into the nature of representations that were recorded, performing analyses with additional methods such as RSA, comparisons with more targeted decoder designs in terms of their target features will be indispensable, as well as more in-depth error type analyses. We leave these analyses as promising directions for future work.

      The writing could be further improved in the introduction and, accordingly, the discussion. The authors listed a series of theories about conceptual representations; however, they did not systematically explain the relationships and controversies between them, and it seems that they did not aim to address the issues raised by these theories anyway. Thus, the extraction of core ideas is suggested. The difference between "modality-agnostic" and terms like "modality-independent," "modality-invariant," "abstract," "amodal," or "supramodal," and the necessity for a novel term should be articulated.

      The updated manuscript includes an improved introduction and discussion section that highlight the main focus and contributions of the study.

      We believe that a systematic comparison of theories on conceptual representations involving their relationships and controversies would require a dedicated review paper. Here, we focused on the aspects that are relevant for the study at hand (modality-invariant representations), for which we find that none of the considered theories can be rejected based on our results.

      Regarding the terminology (modality-agnostic vs. modality-invariant, ..) please refer to the general response.

      The figures also have room to improve. For example, Figures 4 and 5 present dense bar plots comparing multiple decoding settings (e.g., modality-specific vs. modality-agnostic decoders, feature space, within-modal vs. cross-modal, etc.); while comprehensive, they would benefit from clearer labels or separated subplots to aid interpretation. All figures are recommended to be optimized for greater clarity and directness in future revisions.

      Thanks for this remark. We agree that the figures are quite dense in information. However, splitting them up into subplots (e.g. separate subplots for different decoder types) would make it much less straightforward to compare the accuracy scores between conditions. As the main goal of these figures is to compare features and decoder types, we believe that it is useful to keep all information in the same plot. 

      You are also suggesting to improve the clarity of the labels. It is true that the top left legend of Figures 4 and 5 was mixing information about decoder type and broad classes of features  (vision/language/multimodal). To improve clarity, we updated the figures and clearly separated information on decoder type (the hue of different bars) and features (x-axis labels).  The broad classes of features (vision/language/multimodal) are distinguished by alternating light gray background colors and additional labels at the very bottom of the plots.

      The new plots allow for easy performance comparison of the different decoder types and additionally provide information on confidence intervals for the performance of modality-specific decoders, which was not available in the previous figures.

      Reviewer #3 (Recommendations for the authors):

      (1) As discussed in the Public Review, I think the paper would greatly benefit from clearer terminology. Instead of describing the decoders as "modality-agnostic" and "modality-specific", perhaps the authors could describe the decoding conditions based on the train and test modalities (e.g., "image-to-image", "caption-to-image", "multimodal-to-image") or using the terminology from Figure 3 (e.g., "within-modality", "cross-modality", "modality-agnostic").

      We updated our terminology to be clearer and more accurate, as outlined in the general response. The terms modality-agnostic and modality-specific refer to the training conditions, and the test conditions are described in Figure 3 and are used throughout the paper.

      (2) Line 244: I think the multimodal one-back task is an important aspect of the dataset that is worth highlighting. It seems to be a relatively novel paradigm, and it might help ensure that the participants are activating modality-agnostic representations.

      It is true that the multimodal one-back task could play an important role for the activation of modality-invariant representations. Future work could investigate to what degree the presence of widespread modality-invariant representations is dependent on such a paradigm.

      (3) Line 253: Could the authors elaborate on why they chose a random set of training stimuli for each participant? Is it to make the searchlight analyses more robust?

      A random set of training stimuli was chosen in order to maximize the diversity of the training sets, i.e. to avoid bias based on a specific subsample of the CoCo dataset. Between-subject comparisons can still be made based on the test set which was shared for all subjects, with the limitation that performance differences due to individual differences or to the different training sets can not be disentangled. However, the main goal of the data collection was not to make between-subject comparisons based on common training sets, but rather to make group-level analyses based on a large and maximally diverse dataset. 

      (4) Figure 4: Could the authors comment more on the patterns of decoding performance in Figure 5? For instance, it is interesting that ResNet is a better target than ViT, and BERT-base is a better target than BERT-large.

      A multitude of factors influence the decoding performance, such as features dimensionality, model architecture, training data, and training objective(s) (Conwell et al. 2023; Raugel et al. 2025). Bert-base might be better than bert-large because the extracted features are of lower dimension. Resnet might be better than ViT because of its architecture (CNN vs. Transformer). To dive deeper into these differences further controlled analysis would be necessary, but this is not the focus of this paper. The main objective of the feature comparison was to provide a broad overview over visual/linguistic/multimodal feature spaces and to identify the most suitable features for modality-agnostic decoding.

      Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A., & Konkle, T. (2023). What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? (p. 2022.03.28.485868). bioRxiv. https://doi.org/10.1101/2022.03.28.485868

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      (5) Figure 7: It is interesting that the modality-agnostic decoder predictions mostly appear traffic-related. Is there a possibility that the model always produces traffic-related predictions, making it trivially correct for the presented stimuli that are actually traffic-related? It could be helpful to include some examples where the decoder produces other types of predictions to dispel this concern.

      The presented qualitative examples were randomly selected. To make sure that the decoder is not always predicting traffic-related content, we included 5 additional randomly selected examples in Figures 6 and 7 of the updated manuscript. In only one of the 5 new examples the decoder was predicting traffic-related content, and in this case the stimulus had actually been traffic-related (a bus).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      This study explores chromatin organization around trans-splicing acceptor sites (TASs) in the trypanosomatid parasites Trypanosoma cruzi, T. brucei and Leishmania major. By systematically re-analyzing MNase-seq and MNase-ChIP-seq datasets, the authors conclude that TASs are protected by an MNase-sensitive complex that is, at least in part, histone-based, and that single-copy and multi-copy genes display differential chromatin accessibility. Altogether, the data suggest a common chromatin landscape at TASs and imply that chromatin may modulate transcript maturation, adding a new regulatory layer to an unusual gene-expression system.

      I value integrative studies of this kind and appreciate the careful, consistent data analysis the authors implemented to extract novel insights. That said, several aspects require clarification or revision before the conclusions can be robustly supported. My main concerns are listed below, organized by topic/result section.

      TAS prediction * Why were TAS predictions derived only from insect-stage RNA-seq data? Restricting TAS calls to one life stage risks biasing predictions toward transcripts that are highly expressed in that stage and may reduce annotation accuracy for lowly expressed or stage-specific genes. Please justify this choice and, if possible, evaluate TAS robustness using additional transcriptomes or explicitly state the limitation.

      TAS predictions derived only from insect-stage RNA-seq data because in a previous study it was shown that there are no significant differences between stages in the 5'UTR procesing in T. cruzi life stages (https://doi.org/10.3389/fgene.2020.00166) We are not testing an additional transcriptome here, because the robustness of the software was already probed in the original article were UTRme was described (Radio S, 2018 doi:10.3389/fgene.2018.00671).

      Results - "There is a distinctive average nucleosome arrangement at the TASs in TriTryps": * You state that "In the case of L. major the samples are less digested." However, Supplementary Fig. S1 suggests that replicate 1 of L. major is less digested than the T. brucei samples, while replicate 2 of L. major looks similarly digested. Please clarify which replicates you reference and correct the statement if needed.

      The reviewer has a good point. We made our statement based on the value of the maximum peak of the sequenced DNA molecules, which in general is a good indicative of the extension of the digestion achieved by the sample (Cole H, NAR, 2011).

      As the reviewer correctly points, we should have also considered the length of the DNA molecules in each percentile. However, in this case both, T. brucei's and L major's samples were gel purified before sequencing and it is hard to know exactly what fragments were left behind in each case. Therefore, it is better not to over conclude on that regard.

      We have now comment on this in the main manuscript, and we have clarified in the figure legends which data set we used in each case in the figure legends and in Table S1.

      * It appears you plot one replicate in Fig. 1b and the other in Suppl. Fig. S2. Please indicate explicitly which replicate is in each plot. For T. brucei, the NDR upstream of the TAS is clearer in Suppl. Fig. S2 while the TAS protection is less prominent; based on your digestion argument, this should correspond to the more-digested replicate. Please confirm.

      The replicates used for the construction of each figure are explicitly indicated in Table S1. Although we have detailed in the table the original publication, the project and accession number for each data set, the reviewer is correct that in this case it was still not completely clear to which length distribution heatmap was each sample associated with. To avoid this confusion, we have now added the accession number for each data set to the figure legends and also clarified in Table S1. Regarding the reviewer's comment on the correspondence between the observed TAS protection and the extent of samples digestion, he/she is correct that for a more digested sample we would expect a clearer NDR. In this case, the difference in the extent of digestion between these two samples is minor, as observed the length of the main peak in the length distribution histogram for sequenced DNA molecules is the same. These two samples GSM5363006, represented in Fig1 b, and GSM5363007, represented in S2, belong to the same original paper (Maree et al 2017), and both were gel purified before sequencing. Therefore, any difference between them could not only be the result of a minor difference in the digestion level achieved in each experiment but could be also biased by the fragments included or not during gel purification. Therefore, I would not over conclude about TAS protection from this comparison. We have now included a brief comment on this, in the figure discussion

      * The protected region around the TAS appears centered on the TAS in T. brucei but upstream in L. major. This is an interesting difference. If it is technical (different digestion or TAS prediction offset), explain why; if likely biological, discuss possible mechanisms and implications.

      We appreciate the reviewer suggestion. We cannot assure if it is due to technical or biological reasons, but there is evidence that L. major 's genome has a different dinucleotide content and it might have an impact on nucleosome assembly. We have now added a comment about this observation in the final discussion of the manuscript.

      Additionally, we analyzed DRIP-seq data for L. major, recently published doi: 10.1038/s41467-025-56785-y, and we observed that the R-loop footprint co-localized with the MNase-protected region upstream of the TAS (new S5 Fig), suggesting that the shift is not related to the MNase-seq technique.

      Results - "An MNase sensitive complex occupies the TASs in T. brucei": * The definition of "MNase activity" and the ordering of samples into Low/Intermediate/High digestion are unclear. Did you infer digestion levels from fragment distributions rather than from controlled experimental timepoints? In Suppl. Fig. S3a it is not obvious how "Low digestion" was defined; that sample's fragment distribution appears intermediate. Please provide objective metrics (e.g., median fragment length, fraction 120-180 bp) used to classify digestion levels.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fixed time point adding increasing amounts of MNase. However, even when making controlled experimental timepoints, you need to check the length distribution histogram of sequenced DNA molecules to be sure which level of digestion you have achieved.

      In this particular case, we used public available data sets to make this analysis. We made an arbitrary definition of low, intermediate and high level of digestion, not as an absolute level of digestion, but as a comparative output among the tested samples. We based our definition on the comparison of __the main peak in length distribution heatmaps because this parameter is the best metric to estimate the level of digestion of a given sample. It represents the percentage of the total DNA sequenced that contains the predominant length in the sample tested. __Hence, we considered:

      low digestion: when the main peak is longer than the expected protection for a nucleosome (longer than 150 bp). We expect this sample to contain additional longer bands that correspond to less digested material.

      intermediate digestion, when the main peak is the expected for the nucleosome core-protection (˜146-150bp).

      high digestion, when the main peak is shorter than that (shorter than 146 bp). This case, is normally accompanied by a bigger dispersion in fragment sizes.

      To do this analysis, we chose samples that render different MNase protection of the TAS when plotting all the sequenced DNA molecules relative to this point and we used this protection as a predictor of the extent of sample digestion (Figure 2). To corroborate our hypothesis, that the degree of TAS protection was indeed related to the extent of the MNase digestion of a given sample, we looked at the length distribution histogram of the sequenced DNA molecules in each case. It is the best measurement of the extent of the digestion achieved, especially, when sequencing the whole sample without any gel purification and representing all the reads in the analysis as we did. The only caveat is with the sample called "intermediate digestion 1" that belongs to the original work of Mareé 2017, since only this data set was gel purified. To avoid this problem, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      * Several fragment distributions show a sharp cutoff at ~100-125 bp. Was this due to gel purification or bioinformatic filtering? State this clearly in Methods. If gel purification occurred, that can explain why some datasets preserve the MNase-sensitive region.

      The sharp cutoff is neither due to gel purification or bioinformatic filtering, it is just due to the length of the paired-end read used in each case. In earlier works the most common was to sequence only 50bp, with the improvement of technologies it went up to 75,100 or 125 bp. We have now clarified in Table S1 the length of the paired-reads used in each case when possible.

      * Please reconcile cases where samples labeled as more-digested contain a larger proportion of >200 bp fragments than supposedly less-digested samples; this ordering affects the inference that digestion level determines the loss/preservation of TAS protection. Based on the distributions I see, "Intermediate digestion 1" appears most consistent with an expected MNase curve - please confirm and correct the manuscript accordingly.

      As explained above, it's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme, which has a preference for AT reach sequences.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would be to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always get some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well, originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, or by containing a poor AT sequence content, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, you end up observing a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or over digested samples. Our main point, is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Results - "The MNase sensitive complexes protecting the TASs in T. brucei and T. cruzi are at least partly composed of histones": * The evidence that histones are part of the MNase-sensitive complex relies on H3 MNase-ChIP signal in subnucleosomal fragment bins. This seems to conflict with the observation (Fig. 1) that fragments protecting TASs are often nucleosome-sized. Please reconcile these points: are H3 signals confined to subnucleosomal fragments flanking the TAS while the TAS itself is depleted of H3? Provide plots that compare MNase-seq and H3 ChIP signals stratified by consistent fragment-size bins to clarify this.

      What we learned from other eukaryotic organisms that were deeply studied, such as yeast, is that NDRs are normally generated at regulatory points in the genome. In this sense, yeast tRNA genes have a complex with a bootprint smaller than a nucleosome formed by TFIIIC-TFIIB (Nagarajavel, doi: 10.1093/nar/gkt611). On the other hand, many promotor regions have an MNase-sensitive complex with a nucleosome-size footprint, but it does not contain histones (Chereji, et al 2017, doi:10.1016/j.molcel.2016.12.009). The reviewer is right that from Figure 1 and S2 we could observe that the footprint of whatever occupies the TAS region, especially in T. brucei, is nucleosome-size. However, it only shows the size, but it doesn't prove the nature of its components. Nevertheless, those are only MNase-seq data sets. Since it does not include a precipitation with specific antibodies, we cannot confirm the protecting complex is made up by histones. In parallel, a complementary study by Wedel 2017, from Siegel's lab, shows that using a properly digested sample and further immunoprecipitating with a-H3 antibody, the TAS is not protected by nucleosomes at least not when analyzing nucleosome size-DNA molecules. Besides, Briggs et. al 2018 (doi: 10.1093/nar/gky928) showed that at least at intergenic regions H3 occupancy goes down while R-loops accumulation increases. We have now added a new figure 4 replotting R-loops and MNase-ChIP-seq for H3 relative to our predicted TAS showing this anti-correlation and how it partly correlates with MNase protection as well. As a control we show that Rpb9 trends resembles H3 as Siegel's lab have shown in Wedel 2018. Moreover, we analyzed redate from a recently published paper (doi: 10.1038/s41467-025-56785-y) added a new supplemental figure 5 showing that a similar correlation between MNase protection and R-loop footprint occurs in L. major (S5 Fig).

      * Please indicate which datasets are used for each panel in Suppl. Fig. S4 (e.g., Wedel et al., Maree et al.), and avoid calling data from different labs "replicates" unless they are true replicates.

      In most of our analysis we used real replicated experiments. Such is the case MNase-seq data used in Figure 1, with the corresponding replicate experiments used in Figure S2; T. cruzi MNase-ChIP-seq data used in Figure 3b and 4a with the respective replicate used in Figures S4 and S5 (now S6 in the revised manuscript). The only case in which we used experiments coming from two different laboratories, is in the case of MNase-ChIP-seq for H3 from T. brucei. Unfortunately, there are only two public data sets coming each of them from different laboratories. The samples used in Fig 3 (from Siegel's lab) whether the IP from H3 represented in S4 and S5 (S6 n the updated version) comes from another lab (Patterton's). To be more rigorous, we now call them data 1 and 2 when comparing these particular case.

      The reviewer is right that in this particular case one is native chromatin (Pattertons') while the other one is crosslinked (Siegel's). We have now clarified it in the main text that unfortunately we do not count on a replicate but even under both condition the result remains the same, and this is compatible with my own experience, were crosslinking does not affect the global nucleosome patterns (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi: 10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

      * Several datasets show a sharp lower bound on fragment size in the subnucleosomal range (e.g., ~80-100 bp). Is this a filtering artifact or a gel-size selection? Clarify in Methods and, if this is an artifact, consider replotting after removing the cutoff.

      We have only filtered adapter dimmer or overrepresented sequences when needed. In Figures 2 and S3 we represented all the sequenced reads. In other figures when we sort fragments sizes in silico, such as nucleosome range, dinucleosome or subnucleosome size, we make a note in the figure legends. What the reviewer points is related to the length of the sequence DNA fragment in each experiment. As we explained above, the older data-sets were performed with 50 bp paired-end reads, the newer ones are 75, 100 or 125bp. This is information is now clarified in Table S1.

      __Results - "The TASs of single and multi-copy genes are differentially protected by nucleosomes": __

      __ __* Please include T. brucei RNA-seq data in Suppl. Fig. S5b as you did for T. cruzi.

      We have shown chromatin organization for T. brucei in previous S5b to illustrate that there is a similar trend. Unfortunately, we did not get a robust list of multi-copy genes for T. brucei as we did get for T. cruzi, therefore we do not want to over conclude showing the RNA-seq for these subsets of genes. The limitation is related to the fact that UTRme restrict the search and is extremely strict when calling sites at repetitive regions. Additionally, attending to the request of one reviewer we have now changed the UTR predictions for T. brucei using a different RNA-seq data set from Lister 427(detail in method section). Given that with the new predictions it was even harder to obtain the list of multicopy genes for T. brucei, we decided to remove that figure in the updated version of the manuscript.

      * Discuss how low or absent expression of multigene families affects TAS annotation (which relies on RNA-seq) and whether annotation inaccuracies could bias the observed chromatin differences.

      The mapping of occurrence and annotations that belong to repetitive regions has great complexity. UTRme is specially designed to avoid overcalling those sites. In other words, there is a chance that we could be underestimating the number of predicted TASs at multi-copy genes. Regarding the impact on chromatin analysis, we cannot rule out that it might have an impact, but the observation favors our conclusion, since even when some TASs at multi-copy genes can remain elusive, we observe more nucleosome density at those places.

      * The statement that multi-copy genes show an "oscillation" between AT and GC dinucleotides is not clearly supported: the multi-copy average appears noisier and is based on fewer loci. Please tone down this claim or provide statistical support that the pattern is periodic rather than noisy.

      We have fixed this now in the preliminary revised version

      * How were multi-copy genes defined in T. brucei? Include the classification method in Methods.

      This classification was done the same way it was explained for T. cruzi. However, decided to remove the supplemental figure that included this sorting.

      Genomes and annotations: * If transcriptomic data for the Y strain was used for T. cruzi, please explain why a Y strain genome was not used (e.g., Wang et al. 2021 GCA_015033655.1), or justify the choice. For T. brucei, consider the more recent Lister 427 assembly (Tb427_2018) from TriTrypDB. Use strain-matched genomes and transcriptomes when possible, or discuss limitations.

      The most appropriate way to analyze high throughput data, is to aline it to the same genome were the experiments were conducted. This was clearly illustrated in a previous publication from our group were we explained how should be analyzed data from the hybrid CL Brener strain. A common practice in the past was to use only Esmeraldo-like genome for simplicity, but this resulted in output artifacts. Therefore, we aligned it to CL Brener genome, and then focused the main analysis on the Esmeraldo haplotype (Beati Plos ONE, 2023). Ideally, we should have counted on transcriptomic data for the same strain (CL Brener or Esmeraldo). Since this was not the case at that moment, we used data from Y strain that belongs to the same DTU with Esmeraldo.

      In the case of T. brucei, when we started our analysis and the software code for UTRme was written, the previous version of the genome was available. Upon 2018 version came up, we checked chromatin parameters and observed that it did not change the main observations. Therefore, we continue working with our previous setups.

      Reproducibility and broader integration: * Please share the full analysis pipeline (ideally on GitHub/Zenodo) so the results are reproducible from raw reads to plots.

      We are preparing a full pipeline in GitHub. We will make it available before manuscript full revision

      * As an optional but helpful expansion, consider including additional datasets (other life stages, BSF MNase-seq, ATAC-seq, DRIP-seq) where available to strengthen comparative claims.

      We are now including a new figure 4 and a supplemental figure 5 including DRIP-seq and Rp9 ChIP-seq for T. brucei (revised Fig 4) and DRIP-seq for L. major (S5 Fig). Additionally, we added FAIRE-seq data to previous Fig 4 now Fig 5 (revised Fig 5C).

      We are analyzing ATAC-seq data for T. brucei.

      Regarding BSF MNase-seq, the original article by Mareé 2017 claims that there is not significant difference for average chromatin organization between the two life forms; therefore, is not worth including that analysis.

      Optional analyses that would strengthen the study: * Stratify single-copy genes by expression (high / medium / low) and examine average nucleosome occupancy at TASs for each group; a correlation between expression and NDR depth would strengthen the functional link to maturation.

      We have now included a panel in suplemental figure 5 (now revised S6), showing the concordance for chromatin organization of stratified genes by RNA-seq levels relative to TAS.

      __Minor / editorial comments: __ * In the Introduction, the sentence "transcription is initiated from dispersed promoters and in general they coincide with divergent strand switch regions" should be qualified: such initiation sites also include single transcription start regions.

      We have clarified this in the preliminary revised version

      * Define the dotted line in length distribution plots (if it is not the median, please clarify) and consider placing it at 147 bp across plots to ease comparison.

      The dotted line is just to indicate where the maximum peak is located. It is now clarified in figure legends.

      * In Suppl. Fig. 4b "Replicate2" the x-axis ticks are misaligned with labels - please fix.

      We have now fixed the figure. Thanks for noticing this mistake.

      * Typo in the Introduction: "remodellingremodeling" → "remodeling

      Thanks for noticing this mistake, it is fixed in the current version of the manuscript

      **Referee cross-commenting** Comment 1: I think Reviewer #2 and Reviewer #3 missed that they authors of this manuscript do cite and consider the results from Wedel at al. 2017. They even re-analysed their data (e.g. Figure 3a). I second Reviewer #2 comment indicating that the inclusion of a schematic figure to help readers visualize and better understand the findings would be an important addition.

      Comment 2: I agree with Reviewer #3 that the use of different MNase digestion procedures in the different datasets have to be considered. On the other hand, I don't think there is a problem with figure 1 showing an MNase-protected TAS for T. brucei as it is based on MNase-seq data and reproduces the reported results (Maree et al. 2017). What the Siegel lab did in Wedel et al. 2017 was MNase-ChIPseq of H3 showing nucleosome depletion at TAS, but both results are not necessary contradictory: There could still be something else (which does not contain H3) sitting on the TAS protecting it from MNase digestion.

      Reviewer #1 (Significance (Required)):

      This study provides a systematic comparative analysis of chromatin landscapes at trans-splicing acceptor sites (TASs) in trypanosomatids, an area that has been relatively underexplored. By re-analyzing and harmonizing existing MNase-seq and MNase-ChIP-seq datasets, the authors highlight conserved and divergent features of nucleosome occupancy around TASs and propose that chromatin contributes to the fidelity of transcript maturation. The significance lies in three aspects: 1. Conceptual advance: It broadens our understanding of gene regulation in organisms where transcription initiation is unusual and largely constitutive, suggesting that chromatin can still modulate post-transcriptional processes such as trans-splicing. 2. Integrative perspective: Bringing together data from T. cruzi, T. brucei and L. major provides a comparative framework that may inspire further mechanistic studies across kinetoplastids. 3. Hypothesis generation: The findings open testable avenues about the role of chromatin in coordinating transcript maturation, the contribution of DNA sequence composition, and potential interactions with R-loops or RNA-binding proteins. Researchers in parasitology, chromatin biology, and RNA processing will find it a useful resource and a stimulus for targeted experimental follow-up.

      My expertise is in gene regulation in eukaryotic parasites, with a focus on bioinformatic analysis of high-throughput sequencing data

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Siri et al. perform a comparative analysis using publicly available MNase-seq data from three trypanosomatids (T. brucei, T. cruzi, and Leishmania), showing that a similar chromatin profile is observed at TAS (trans-splicing acceptor site) regions. The original studies had already demonstrated that the nucleosome profile at TAS differs from the rest of the genome; however, this work fills an important gap in the literature by providing the most reliable cross-species comparison of nucleosome profiles among the tritryps. To achieve this, the authors applied the same computational analysis pipeline and carefully evaluated MNase digestion levels, which are known to influence nucleosome profiling outcomes.

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. The manuscript could be improved with some clarifications and adjustments:

      1. The authors state from the beginning that available MNase data indicate altered nucleosome occupancy around the TAS. However, they could also emphasize that the conclusions across the different trypanosomatids are inconsistent and even contradictory: NDR in T. cruzi versus protection-in different locations-in T. brucei and Leishmania.

      We start our manuscript by referring to the first MNase-seq data sets publicly available for each TriTryp and we point that one of the main observations, in each of them, is the occurrence of a change in nucleosome density or occupancy at intergenic regions. In T. cruzi, in a previous publication from our group, we stablished that this intergenic drop in nucleosome density occurs near the trans-splicing acceptor site. In this work, we extend our study to the other members of TriTryps: T. brucei and L. major.

      In T. brucei the papers from Patterton's lab and Siegel's lab came out almost simultaneously in 2017. Hence, they do not comment on each other's work. The first one claims the presence of a well-positioned nucleosome at the TAS by using MNase-seq, while the second one, shows an NDR at the TAS by using MNase-ChIP-seq. However, we do not think they are contradictory, or they have inconsistency. We brought them together along the manuscript because we think these works can provide complementary information.

      On one hand, we infer data from Pattertons lab is slightly less digested than the sample from Siegel's lab. Therefore, we discuss that this moderate digestion must be the reason why they managed to detect an MNase protecting complex sitting at the TAS (Figure 1). On the other hand, Sigel's lab includes an additional step by performing MNase-ChIP-seq, showing that when analyzing nucleosome size fragments, histones are not detected at the TAS. Here, we go further in this analysis on figure 3, showing that only when looking at subnucleosome-size fragments, we can detect histone H3. And this is also true for T. cruzi.

      By integrating every analysis in this work and the previous ones, we propose that TASs are protected by an MNase-sensitive complex (proved in Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). To be sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs, 2018 doi: 10.1093/nar/gky928) and that R-loops have plenty of interacting proteins (Girasol, 2023 10.1093/nar/gkad836). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules, possibly involved in trans-splicing. We have now added a new figure 4 showing R-loop co-localization with the NDR.

      Regarding the comparison between different organisms, after explaining the sensitivity to MNase of the TAS protecting complex, we discuss that when comparing equally digested samples T. cruzi and T. brucei display a similar chromatin landscape with a mild NDR at the TAS (See T. cruzi represented in Figure 1 compared to T. brucei represented in Intermediate digestion 2 in Figure 2, intermediate digestion in the revised manuscript). Unfortunately, we cannot make a good comparison with L. major, since we do not count on a similar level of digestion. However, by analyzing a recently published DRIP-seq data-set for L. major we show that R-loop signal co localize with MNase-protection in a similar way (new S5 Fig).

      Another point that requires clarification concerns what the authors mean in the introduction and discussion when they write that trypanosomes have "...poorly organized chromatin with nucleosomes that are not strikingly positioned or phased." On the other hand, they also cite evidence of organization: "...well-positioned nucleosome at the spliced-out region.. in Leishmania (ref 34)"; "...a well-positioned nucleosome at the TASs for internal genes (ref37)"; "...a nucleosome depletion was observed upstream of every gene (ref 35)." Aren't these examples of organized chromatin with at least a few phased nucleosomes? In addition, in ref 37, figure 4 shows at least two (possibly three to four) nucleosomes that appear phased. In my opinion, the authors should first define more precisely what they mean by "poorly organized chromatin" and clarify that this interpretation does not contradict the findings highlighted in the cited literature.

      For a better understanding of nucleosome positioning and phasing I recommend the review: Clark 2010 doi:10.1080/073911010010524945, Figure 4. Briefly, in a cell population there are different alternative positions that a given nucleosome can adopt. However, some are more favorable. When talking about favorable positions, we refer to the coordinates in the genome that are most likely covered by a nucleosome and are predominant in the cell population. Additionally, nucleosomes could be phased or not. This refers not only the position in the genome, but to the distance relative to a given point. In yeast, or in highly transcribed genes of more complex eukaryotes, nucleosomes are regularly spaced and phased relative to the transcription start site (TSS) or to the +1 nucleosome (Ocampo, NAR, 2016, doi:10.1093/nar/gkw068). In trypanosomes, nucleosomes have some regular distribution when making a browser inspection but, given that they are not properly phased with respect to any point, it is almost impossible to make a spacing estimation from paired-end data. This is also consistent with a chromatin that is transcribed in an almost constitutive manner.

      As the reviewer mention, we do site evidence of organization. We think the original observations are correct, but we do not fully agree with some of the original statements. In this manuscript our aim is to take the best we learned from their original works and to make a constructive contribution adding to the original discussions. In this regard, in trypanosomes there are some conserved patterns in the chromatin landscape, but their nucleosomes are far from being well-positioned or phased. For a better understanding, compare the variations observed in the y axis when representing av. nucleosome occupancy in yeast with those observed in trypanosomes and you will see that the troughs and peaks are much more prominent in yeast than the ones observed in any TryTryp member.

      Following the reviewer's suggestion we have now clarified this in the main text.

      The paper would also benefit from the inclusion of a schematic figure to help readers visualize and better understand the findings. What is the biological impact of having nucleosomes, di-nucleosomes, or sub-nucleosomes at TAS? This is not obvious to readers outside the chromatin field. For example, the following statement is not intuitive: "We observed that, when analyzing nucleosome-size (120-180 bp) DNA molecules or longer fragments (180-300 bp), the TASs of either T. cruzi or T. brucei are mostly nucleosome-depleted. However, when representing fragments smaller than a nucleosome-size (50-120 bp) some histone protection is unmasked (Fig. 3 and Fig. S4). This observation suggests that the MNase sensitive complex sitting at the TASs is at least partly composed of histones." Please clarify.

      We appreciate the reviewer's suggestion to make a schematic figure. We have now added a new Figure 6.

      Regarding the biological impact of having mono, di or subnucleosome fragments, it is important to unveil the fragment size of the protected DNA to infer the nature of the protecting complex. In the case of tRNA genes in yeast, at pol III promoters they found footprints smaller than a nucleosome size that ended up being TFIIB-TFIIC (Nagarajavel, doi: 10.1093/nar/gkt611). Therefore, detecting something smaller than a nucleosome might suggest the binding of trans-acting factors different than histones or involving histones in a mixed complex. These mixed complexes are also observed, and that is the case of the centromeric nucleosome which has a very peculiar composition (Ocampo and Clark, Cells Reports, 2015). On the other hand, if instead we detect bigger fragments, it could be indicative of the presence of bigger protecting molecules or that those regions are part of higher order chromatin organization still inaccessible for MNase linker digestions.

      Here we show on 2Dplots, that complex or components protecting the TAS have nucleosome size, but we cannot assure they are entirely made up by histones, since, only when looking at subnucleosome-size fragments, we are able to detect histone H3. We have now added part of this explanation to the discussion.

      By integrating every analysis in this work and the previous ones, we propose that the TAS is protected by an MNase-sensitive complex (Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). As explained above, to be sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs 2018) and that R-loops have plenty of interacting proteins (Girasol, 2023). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules. We have now added a new figure 4 showing R-loop partial co-localization with MNase protection.

      Some references are missing or incorrect:

      we will make a thorough revision

      "In trypanosomes, there are no canonical promoter regions." - please check Cordon-Obras et al. (Navarro's group). Thank you for the appropiate suggestion.

      Thank you for the appropriate suggestion. We have now added this reference

      Please, cite the study by Wedel et al. (Siegel's group), which also performed MNase-seq analysis in T. brucei.

      We understand that reviewer number 2# missed that we cited this reference and that we did used the raw data from the manuscript of Wedel et. al 2017 form Siegel's group. We used the MNase-ChIP-seq data set of histone H3 in our analysis for Figures 3, S4 and S6 (in the revised version), also detailed in table S1. To be even more explicit, we have now included the accession number of each data set in the figure legends.

      Figure-specific comments: Fig. S3: Why does the number of larger fragments increase with greater MNase digestion? Shouldn't the opposite be expected?

      This a good observation. As we also explained to reviewer#1:

      It's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always have some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, making their linker DNA extremely resistant to initial cleavage. Once most of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, there you end up having a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or overdirected samples. Our main point is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Minor points:

      There are several typos throughout the manuscript.

      Thanks for the observation. We will check carefully.

      Methods: "Dinucelotide frecuency calculation."

      We will add a code in GitHub

      Reviewer #2 (Significance (Required)):

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. Audience: basic science and specialized readers.

      Expertise: epigenetics and gene expression in trypanosomatids.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      The authors analysed publicly accessible MNase-seq data in TriTryps parasites, focusing on the chromatin structure around trans-splicing acceptor sites (TASs), which are vital for processing gene transcripts. They describe a mild nucleosome depletion at the TAS of T. cruzi and L. major, whereas a histone-containing complex protects the TASs of T. brucei. In the subsequent analysis of T. brucei, they suggest that a Mnase-sensitive complex is localised at the TASs. For single-copy versus multi-copy genes, the authors show different di-nucleotide patterns and chromatin structures. Accordingly, they propose this difference could be a novel mechanism to ensure the accuracy of trans-splicing in these parasites.

      Before providing an in- depth review of the manuscript, I note that some missing information would have helped in assessing the study more thoroughly; however, in the light of the available information, I provide the following comments for consideration.

      The numbering of the figures, including the figure legends, is missing in the PDF file. This is essential for assessing the provided information.

      We apologized for not including the figure numbers in the main text, although they are located in the right place when called in the text. The omission was unwillingly made when figure legends were moved to the bottom of the main text. This is now fixed in the updated version of the manuscript.

      The publicly available Mnase- seq data are manyfold, with multiple datasets available for T. cruzi, for example. It is unclear from the manuscript which dataset was used for which figure. This must be clarified.

      This was detailed in Table S1. We have now replaced the table by an improved version, and we have also included the accession number of each data set used in the figure legends.

      Why do the authors start in figure 1 with the description of an MNase- protected TAS for T.brucei, given that it has been clearly shown by the Siegel lab that there is a nucleosome depletion similar to other parasites?

      We did not want to ignore the paper from Patterton's lab because it was the first one to map nucleosomes genome-wide in T. brucei and the main finding of that paper claimed the existence of a well-positioned nucleosome at intergenic regions, what we though constitutes a point worth to be discussed. While Patterton's work use MNase-seq from gel-purified samples and provides replicated experiments sequenced in really good depth; Siegel's lab uses MNase-ChIP-seq of histone H3 but performs only one experiment and its input was not sequenced. So, each work has its own caveats and provides different information that together contributes to make a more comprehensive study. We think that bringing up both data sets to the discussion, as we have done in Figures 1 and 3, helps us and the community working in the field to enrich the discussion.

      If the authors re- analyse the data, they should compare their pipeline to those used in the other studies, highlighting differences and potential improvements.

      We are working on this point. We will provide a more detail description in the final revision.

      Since many figures resemble those in already published studies, there seems little reason to repeat and compare without a detailed comparison of the pipelines and their differences.

      Following the reviewer advice, we are now working on highlighting the main differences that justify analyzing the data the way we did and will be added in the finally revised method section.

      At a first glance, some of the figures might look similar when looking at the original manuscripts comparing with ours. However, with a careful and detailed reading of our manuscripts you can notice that we have added several analyses that allow to unveil information that was not disclosed before.

      First, we perform a systematic comparison analyzing every data set the same way from beginning to end, being the main difference with previous studies the thorough and precise prediction of TAS for the three organisms. Second, we represent the average chromatin organization relative to those predicted TASs for TriTryps and discuss their global patterns. Third, by representing the average chromatin into heatmaps, we show for the very first time, that those average nucleosome landscape are not just an average, they keep a similar organization in most of the genome. These was not done in any of the previous manuscripts except for our own (Beati, PLOS One 2023). Additionally, we introduce the discussion of how the extension of MNase reaction can affect the output of these experiments and we show 2D-plots and length distribution heatmaps to discuss this point (a point completely ignored in all the chromatin literature for trypanosomes). Furthermore, we made a far-reaching analysis by considering the contributions of each publish work even when addressed by different techniques. Finally, we discuss our findings in the context of a topic of current interest in the field, such as TriTryp's genome compartmentalization.

      Several previous Mnase- seq analysis studies addressing chromatin accessibility emphasized the importance of using varying degrees of chromatin digestion, from low to high digestion (30496478, 38959309, 27151365).

      The reviewer is correct, and this point is exactly what we intended to illustrate in figure number 2. We appreciate he/she suggests these references that we are now citing in the final discussion. Just to clarify, using varying degrees of chromatin digestion is useful to make conclusions about a given organism but when comparing samples, strains, histone marks, etc. It is extremely important to do it upon selection of similar digested samples.

      No information on the extent of DNA hydrolysis is provided in the original Mnase- seq studies. This key information can not be inferred from the length distribution of the sequenced reads.

      The reviewer is correct that "No information on the extent of DNA hydrolysis is provided in the original Mnase-seq studies" and this is another reason why our analysis is so important to be published and discussed by the scientific community working in trypanosomes. We disagree with the reviewer in the second statement, since the level of digestion of a sequenced sample is actually tested by representing the length distribution of the total DNA sequenced. It is true that before sequencing you can, and should, check the level of digestion of the purified samples in an agarose gel and/or in a bioanalyzer. It could be also tested after library preparation, but before sequencing, expecting to observe the samples sizes incremented in size by the addition of the library adapters. But, the final test of success when working with MNase digested samples is to analyze length of DNA molecules by representing the histograms with length distribution of the sequenced DNA molecules. Remarkably, on occasions different samples might look very similar when run in a gel, but they render different length distribution histograms and this is because the nucleosome core could be intact but they might have suffered a differential trimming of the linker DNA associated to it or even be chewed inside (see Cole Hope 2011, section 5.2, doi: 10.1016/B978-0-12-391938-0.00006-9, for a detailed explanation).

      As the input material are selected, in part gel- purified mono- nucleosomal DNA bands. Furthermore the datasets are not directly comparable, as some use native MNase, while others employ MNase after crosslinking; some involve short digestion times at 37 {degree sign} C, while others involve longer digestion at lower temperatures. Combining these datasets to support the idea of an MNase- sensitive complex at the TAS of T. brucei therefore may not be appropriate, and additional experiments using consistent methodologies would strengthen the study's conclusions.

      In my opinion, describing an MNase- sensitive complex based solely on these data is not feasible. It requires specifically designed experiments using a consistent method and well- defined MNase digestion kinetics.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fix time point adding increasing amounts of MNase. However, the information obtained from the detail analysis of the length distribution histogram of sequenced DNA molecules the best test of the real outcome. In fact, those samples with different digestion levels were probably not generated on purpose.

      The only data sets that were gel purified are those from Mareé 2017 (Patterton's lab), used in Figures 1, S1 and S2 and those from L. major shown in Fig 1. It was a common practice during those years, then we learned that is not necessary to gel purify, since we can sort fragment sizes later in silico when needed.

      As we explained to reviewer #1, to avoid this conflict, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      Reviewer #3 (Significance (Required)):

      Due to the lack of controlled MNase digestion, use of heterogeneous datasets, and absence of benchmarking against previous studies, the conclusions regarding MNase-sensitive complexes and their functional significance remain speculative. With standardized MNase digestion and clearly annotated datasets, this study could provide a valuable contribution to understanding chromatin regulation in TriTryps parasites.

      As we have explained in the previous point our conclusions are valid since we do not compare in any figure samples coming from different treatments. The only exception to this comment could be in figure 3 when talking about MNase-ChIP-seq. We have now added a clear and explicit comment in the section and the discussion that despite having subtle differences in experimental procedures we arrive to the same results. This is the case for T. cruzi IP, run from crosslinked chromatin, compared to T. brucei's IP, run from native chromatin.

      Along the years it was observed in the chromatin field that nucleosomes are so tightly bound to DNA that crosslinking is not necessary. However, it is still a common practice specially when performing IPs. In our own hands, we did not observe any difference at the global level neither in T. cruzi (unpublished) nor in my previous work with yeast (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi:10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

    1. Author response:

      Reviewer #1:

      Comment 1: The authors use a confusing timeline for their behavioral experiments, i.e., day 1 is the first day of training in the MWM, and day 6 is the probe trial, but in reality, day 6 is the first day after the last training day. So this is really day 1 post-training, and day 20 is 14 days post-training.

      We thank this reviewer for pointing out the issue of the behavioral timeline. We will revise the behavioral timeline as suggested by this reviewer. Days 1–5 will be labeled as “Training phase day 1–5”. Day 6 will be labeled as the “Day 1 post-training” and Day 20 will be labeled as the “Day 14 post-training”.

      Comment 2: The authors inaccurately use memory as a term. During the training period in the MWM, the animals are learning, while memory is only probed on day 6 (after learning). Thus, day 6 reflects memory consolidation processes after learning has taken place.

      We will revise the manuscript to distinguish between "learning" and "memory." We will refer to the performance during the 5-day training period as "spatial learning" and restrict the term "memory" to the probe tests on Day 6, which reflect memory processes after learning has taken place.

      Comment 3: The NAT10 cKO mice are useful... but all the experiments used AAV-CRE injections in the dorsal hippocampus that showed somewhat modest decreases... For these experiments, it would be better to cross the NAT10 floxed animals to CRE lines where a better knockdown of NAT10 can be achieved, with less variability.

      We want to clarify the reason for using AAV-Cre injection rather than Cre lines. Indeed, we attempted to generate Nat10 conditional knockouts by crossing Nat10<sup>flox/flox</sup> mice with several CNS-specific Cre lines. Crossing with Nestin-Cre and Emx1-Cre resulted in embryonic and premature lethality, respectively, consistent with the essential housekeeping function of NAT10 during neurodevelopment. We are currently using the Camk2α-Cre line which starts to express Cre after postnatal 3 weeks specifically in hippocampal pyramidal neurons (Tsien et al., 1996).

      Comment 4: Because knockdown is only modest (~50%), it is not clear if the remaining ac4c on mRNAs is due to remaining NAT10 protein or due to an alternative writer (as the authors pose).

      Our results suggest the existence of alternative writers. As shown in Figure 6D, we identified a population of "NAT10-independent" MISA mRNAs (present in MISA but not downregulated in NASA). Remarkably, these mRNAs possess a consensus motif (RGGGCACTAACY) that is fundamentally different from the canonical NAT10 motif (AGCAGCTG). This distinct motif usage suggests that the residual ac4C signals are not merely due to incomplete knockdown of NAT10, but reflect the activity of other, as-yet-unidentified ac4C writers. Nonetheless, we think that generation of a Nat10 knockout line with completely loss of NAT10 proteins is useful to address this reviewer’s concern.

      Reviewer #2:

      Comment 1: It is known that synaptosomes are contaminated with glial tissue... So the candidate mRNAs identified by acRIP-seq might also be mixed with glial mRNAs. Are the GO BP terms shown in Figure 3A specifically chosen, or unbiasedly listed for all top ones?

      It is true that some ac4C-mRNAs identified by acRIP-seq from the synaptosomes are highly expressed in astrocyte, such as Aldh1l1, ApoE, Sox9 and Aqp4 (Table S3, Fig. S6H). In agreement, we found that NAT10 was also expressed in astrocyte in addition to neurons. We will show representative image for the expression of NAT10-Cre in astrocytes in the revised MS. The BP items shown in Fig. 3A were chosen from top 30 and highly related with synaptic plasticity and memory. We will show the full list of significant BP items for MISA in the revised MS.

      Comment 2: Where does NAT10-mediated mRNA acetylation take place within cells generally? Is there evidence that NAT10 can catalyze mRNA acetylation in the cytoplasm?

      The previous studies from non-neuronal cells showed that NAT10 can catalyze mRNA acetylation in the cytoplasm and enhance translational efficiency (Arango et al., 2018; Arango et al., 2022). In this study, we showed that mRNA acetylation occurred both in the homogenates and synapses (see ac4C-mRNA lists in Table S2 and S3). However, spatial memory upregulated mRNA acetylation mainly in the synapses rather than in the homogenates (Fig. 2 and Fig. S2).

      Comment 3: "The NAT10 proteins were significantly reduced in the cytoplasm (S2 fraction) but increased in the PSD fraction..." The small increase in synaptic NAT10 might not be enough to cause a decrease in soma NAT10 protein level.

      We showed that the NAT10 protein levels were increased by one-fold in the PSD fraction, but were reduced by about 50% in the cytoplasm after memory formation (Fig. 5J and K). The protein levels of NAT10 in the homogenates and nucleus were not altered after memory formation (Fig. 5F and I). Due to these facts, we hypothesized that NAT10 proteins may have a relocation from cytoplasm to synapses after memory formation, which was also supported by the immunofluorescent results from cultured neurons (Fig. S4). However, we agree with this reviewer that drawing such a conclusion may require the time-lapse imaging of NAT10 protein trafficking in living animals, which is technically challenging at this moment.

      Comment 4: It is difficult to separate the effect on mRNA acetylation and protein mRNA acetylation when doing the loss of function of NAT10.

      This is a good point. We agree with this reviewer that NAT10 may acetylate both mRNA and proteins. We examined the acetylation levels of -tubulin and histone H3, two substrate proteins of NAT10 in the hippocampus of Nat10 cKO mice. As shown in Fig S5C, E, and F, the acetylation levels of -tubulin and histone H3 remained unchanged in the Nat10 cKO mice, likely due to the compensation by other protein acetyltransferases. In contrast, mRNA ac4C levels were significantly decreased in the Nat10 cKO mice (Figure S5G–H). These results suggest that the memory deficits seen in Nat10 cKO mice may be largely due to the impaired mRNA acetylation. Nonetheless, we believe that developing a new technology which enables selective erasure of mRNA acetylation would be helpful to address the function of mRNA. We discussed these points in the MS (line 585-592).

      References

      Arango, D., Sturgill, D., Alhusaini, N., Dillman, A. A., Sweet, T. J., Hanson, G., Hosogane, M., Sinclair, W. R., Nanan, K. K., & Mandler, M. D. (2018). Acetylation of cytidine in mRNA promotes translation efficiency. Cell, 175(7), 1872-1886. e1824.

      Arango, D., Sturgill, D., Yang, R., Kanai, T., Bauer, P., Roy, J., Wang, Z., Hosogane, M., Schiffers, S., & Oberdoerffer, S. (2022). Direct epitranscriptomic regulation of mammalian translation initiation through N4-acetylcytidine. Molecular cell, 82(15), 2797-2814. e2711.

      Tsien, J. Z., Chen, D. F., Gerber, D., Tom, C., Mercer, E. H., Anderson, D. J., Mayford, M., Kandel, E. R., & Tonegawa, S. (1996). Subregion-and cell type–restricted gene knockout in mouse brain. Cell, 87(7), 1317-1326.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03195R

      Point-by-Point Response to Reviewers

      We thank the reviewers for their thoughtful and constructive evaluations, which have helped us substantially improve the clarity, rigor, and balance of our manuscript. We are grateful for their recognition that our integrated ATAC-seq and RNA-seq analyses provide a valuable and technically sound contribution to understanding soxB1-2 function and regenerative neurogenesis in planarians.

      We have carefully addressed the reviewers' major points as follows:

      1. Direct versus indirect regulation by SoxB1-2:____ In the revision, we explicitly acknowledge the limitations of inferring direct regulation from our current datasets and have revised statements throughout the Results and Discussion to emphasize that our findings are correlative.
      2. Evidence for pioneer activity:____ Although the pioneer role of SoxB1 transcription factors in well established in other systems, we agree that additional binding or motif data would be required to formally demonstrate SoxB1-2 pioneer function. Accordingly, we performed motif analysis and revised the text throughout to frame SoxB1-2's proposed role as consistent with, rather than demonstrating transcriptional activator activity.
      3. Motif enrichment and downstream regulatory interactions:____ In response to Reviewer #1's suggestion, we have included a new motif enrichment analysis in the supplement to contextualize possible co-regulators within the SoxB1-2 network.
      4. Data reproducibility and peak-calling consistency:____ We have included sample correlations ____and peak overlaps for ATAC-seq samples in the revision, providing a clearer assessment of reproducibility.
      5. Clarification of co-expression and downstream targets:____ We included co-expression plots for soxB1-2 with mecom and castor in the supplemental materials. These plots were generated from previously published scRNA-seq data and demonstrate that cells expressing soxB1-2 also express mecom and __ __We appreciate the reviewers' recognition that our methods are rigorous and our data accessible. We have incorporated all major revisions suggested and believe have strengthened the manuscript's precision, interpretations, and conclusions. Below, we respond to each comment in detail.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary

      The authors of this interesting study take the approach of combining RNAi, RNA-seq and ATAC-seq to try to build a regulatory network surrounding the function of a planarian SoxB1 ortholog, broadly required for neural specification during planarian regeneration. They find a number of chromatin regions that differentially accessible (measured by ATAC-seq), associate these with potential genes by proximity to the TSS. They then compare this set of genes with those that are differentially regulated (using RNA-seq), after SoxB1 RNAi mediated knockdown. This allows them the authors some focus on potential directly regulated targets of the planarian SoxB1. Two of these downstream targets, the mecom and castor transcription factors are then studied in greater detail.

      Major Comments

      I have no suggestions for new experiments that fit sensibly with the scope of the current work. There are other analyses that could be appropriate with the ATAC-seq data, but may not make sense in the content of SoxB1 acting as pioneer factor.

      I would like to see motif enrichment analysis under the set of peaks to see if SoxB1 is opening chromatin for a restricted set of other transcription factors to then bind. Much of this could be taken from Neiro et al, eLife 2022 (which also used ATAC-seq) and matched planarians TF families to likely binding motifs. This could add some breadth to the regulatory network. It could be revealing for example if downstream TF also help regulate other targets that SoxB1 makes available, this is pattern often seen for cell specification (as I am sure the authors are aware). Alternatively, it may reveal other candidate regulators.

      Thank you for this suggestion. We agree with the reviewers that this analysis should be done. We ran the motif enrichment analysis using the same methods as outlined in Neiro et al. eLife, 2022. We have included a new motif enrichment analysis in the supplement to contextualize possible co-regulators within the SoxB1-2 network.

      Overall peak calling consistency with ATAC-sample would be useful to report as well, to give readers an idea of noise in the data. What was the correlation between samples?

      __Excellent point. In response to this comment, we ran a Pearson correlation test on replicates within gfp and soxB1-2 RNAi replicates to get an idea of overall correlation between replicates. Additionally, we calculated percent overlap of peaks for biological replicates and between treatment groups. __

      While it is logical to focus on downregulated genes, it would also be interesting to look at upregulated genes in some detail. In simple terms would we expect to see the representation of an alternate set of fate decisions being made by neoblast progeny?

      This is also an important point that we considered but initially did not pursue it due to the lack of tools to test upregulated gene function. However, the reviewer is correct that this is straightforward to perform computationally. Thus, we have performed Gene Ontology analysis on the upregulated genes in all RNA-seq datasets (soxB1-2 RNAi, mecom RNAi, and castor RNAi). Both mecom and castor datasets did not reveal enrichment within the upregulated portion of the dataset. Genes upregulated after soxB1-2 RNAi were enriched for metabolic, xenobiotic detoxification, potassium homeostasis, and endocytic programs. Rather than indicating a shift toward alternative lineages, including non-ectodermal fates, these signatures are consistent with stress-responsive and homeostatic programs activated following loss of soxB1-2. We did not detect enrichment patterns strongly associated with alternative cell fates. We conclude that this analysis does not formally exclude potential shifts in lineage-specific transcriptional programs, but does support our hypothesis that soxB1-2 functions as a transcriptional activator.

      Can the authors be explicit about whether they have evidence for co-expression of SoxB1/castor and SoxB1/mecom? I could find this clearly and it would be important to be clear whether this basic piece of evidence is in place or not at this stage.

      We included co-expression plots for soxB1-2 with mecom and castor in the supplemental material. These plots were generated from previously published scRNA-seq data and demonstrate that cells expressing soxB1-2 also express mecom and castor. We have not done experiments showing co-expression via in situ at this time.

      Minor comments

      Formally loss of castor and mecom expression does mean these cells are absent, strictly the cell absence needs an independent method. It might be useful to clarify this with the evidence of be clear that cells are "very probably" not produced.

      We agree that loss of castor and mecom expression does not formally demonstrate the physical absence of these cells, and that independent methods would be required to definitively confirm their loss. In response, we have revised our wording to indicate that castor- and mecom-expressing cells are very likely not being produced, rather than stating that they are absent.

      Reviewer #1 (Significance (Required)):

      Significance

      Strengths and limitations.

      The precise exploitation of the planarian system to identify potential targets, and therefore regulatory mechanisms, mediated by SoxB1 is an interesting contribution to the fi eld. We know almost nothing about the regulatory mechanisms that allow regeneration and how these might have evolved, and this work is well-executed step in that direction.

      Advance

      The paper makes a clear advance in our understanding of an important process in animals (neural specification) and how this happens in the context in the context during an example of animal regeneration. The methods are state-of-the-art with respect to what is possible in the planarian system.

      Audience

      This will be of wide interest to developmental biologists, particularly those studying regeneration in planarians and other regenerative systems,and those who study comparative neurodevelopment.

      Expertise

      I have expertise in functional genomics in the context of stem cells and regeneration, particularly in the planarian model system

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Review - Cathell, et al (RC-2025-03195)

      Summary and Significance:

      Understanding regenerative neurogenesis has been difficult due to the limited amount of neurogenesis that occurs after injury in most animal species. Planarians, with their adult neurogenesis and robust post-injury response, allow us to get a glimpse into regenerative neurogenesis. The Zayas laboratory previously revealed a key role for SoxB1-2 in maintenance and regeneration of a broad set of sensory and peripheral neurons in the planarian body. SoxB1-2 also has a role in many epidermal fates. Their previous work left open the tempting possibility that SoxB1-2 acts as a very upstream regulator of epidermal and neuronal fates, potentially acting as a pioneer transcription factor within these lineages. In the manuscript currently under review, Cathell and colleagues use ATAC-Seq and RNA-Seq to investigate chromatin changes after SoxB1-2(RNAi). With the experimental limitations in planarians, this is a strong first step toward testing their hypothesis that SoxB1-2acts as a pioneer within a set of planarian lineages. Beyond these cell types, this work is also important because planarian cell fates often rely on a suite of transcription factors, but the nature of transcription factor cooperation has been much less well understood. Indeed, the authors do show that loss of SoxB1-2 by RNAi causes changes in a number of accessible regions of the genome; many of these chromatin changes correspond to changes in gene expression of genes nearby these peaks. The authors also examine in more detail two genes that have genomic and transcriptomic changes after SoxB1-2(RNAi), mecom and castor. The authors completed RNA-Seq on mecom(RNAi) and castor(RNAi) animals, identifying genes downregulated after loss of either factor that are also seen in SoxB1-2(RNAi). The results in this paper are rigorous and very well presented. I will share two major limitations of the study and some suggestions for addressing them, but this work may also be acceptable without those changes at some journals.

      Limitation 1:

      The paper aims to test the hypothesis that SoxB1-2 is a pioneer transcription factor. Observation that SoxB1-2(RNAi) leads to loss of many accessible regions in the chromatin supports the hypothesis. However, an alternate possibility is that SoxB1-2 leads to transcription of another factor that is a pioneer factor or a chromatin remodeling enzyme; in either of these cases, the accessibility peak changes may not be due to SoxB1-2 directly but due to another protein that SoxB1-2 promotes. The authors describe how they can address this limitation in the future; in the meantime, is it known what the likely binding for SoxB1-2 would be (experimentally or based on homology)? If so, could the authors examine the relative abundance of SoxB1-2 binding sites in peaks that change after SoxB1-2(RNAi)? This could be compared to the abundance of the same binding sequence in non-changing peaks. Enrichment of SoxB1-2 binding sites in ATAC peaks that change after its RNAi would support the argument that chromatin changes are directly due to SoxB1-2.

      We appreciate the feedback and agree that distinguishing between direct SoxB1-2 pioneer activity and indirect effects mediated through downstream regulators is an important consideration. While we did not perform a direct abundance analysis of potential chromatin-remodeling cofactors, we conducted a motif enrichment analysis following the approach of Neiro et al. (eLife, 2022), comparing control and soxB1-2(RNAi) peak sets. This analysis revealed that Sox-family motifs, particularly SoxB1-like motifs, were among the most enriched in regions that remain accessible in control animals relative to soxB1-2(RNAi) animals, consistent with a model in which SoxB1-2 directly contributes to establishing or maintaining accessibility at these loci. We have now included this analysis in the supplemental materials to further contextualize potential co-regulators and transcriptional partners within the SoxB1-2 regulatory network. We agree and acknowledge in the report that future studies assessing chromatin remodeling factor expression and abundance will be valuable to definitively separate direct and indirect pioneer activity.

      Limitation 2:

      The characterization of mecom and castor is somewhat preliminary relative to the deep work in the rest of the paper. I think this could be addressed with a few experiments. The authors could validate RNA-seq findings with ISH to show that cells are lost after reduction of either TF (this would support the model figure). The authors could also try to define whether loss of either TF causes behavioral phenotypes that might be similar to SoxB1-2(RNAi); this would be a second line of evidence that the TFs are downstream of key events in the SoxB1-2

      pathway.

      Thank you for this suggestion. We agree that additional validation of the mecom and castor RNA-seq results and further phenotypic characterization would strengthen this section. We are currently conducting in situ hybridization experiments to validate transcriptional changes in mecom and castor using the same experimental framework applied to soxB1-2 downstream candidates. We anticipate completing these studies within the next three months and will incorporate the results into future work.

      Regarding behavioral phenotypes, we performed preliminary screening for robust behavioral responses, including mechanosensory responses, but did not observe overt defects. However, the lack of established, standardized behavioral assays in planarians presents a current limitation; such assays need to be developed de novo, and predicting specific behavioral phenotypes in advance remains challenging. We fully agree that functional behavioral assays represent an important next step and are actively exploring strategies to systematically develop and implement them going forward.

      Other questions or comments for the authors:

      Is it known how other Sox factors work as pioneer TFs? Are key binding partners known? I wondered if it would be possible to show that SoxB1-2 is co-expressed with the genes that encode these partners and/or if RNAi of these factors would phenocopy SoxB1-2. This is likely beyond the scope of this paper, but if the authors wanted to further support their argument about SoxB1-2 acting as a pioneer in planarians, this might be an additional way to do it.

      In other systems, Sox pioneer factors often act together with POU family transcription factors (for example, Oct4 and Brn2) and PAX family members such as Pax6. In planarians, a POU homolog (pou-p1) is expressed in neoblasts and may represent an interesting candidate co-factor for future investigation in the context of SoxB1-2 pioneer activity. We have also previously examined the relationship between SoxB1-2 and the POU family transcription factors pou4-1 and pou4-2. Although RNAi of these factors does not fully phenocopy soxB1-2 knockdown, pou4-2(RNAi) results in loss of mechanosensation, suggesting that downstream POU factors may contribute to aspects of neural function regulated by SoxB1-2 (McCubbin et al. eLife 2025). We agree that co-expression and functional interaction studies with these candidates would be highly informative, and we view this as an exciting future direction beyond the scope of the current manuscript.

      This paper is one of few to use ATAC-Seq in planarians. First, I think the authors should make a bigger deal of their generation of a dataset with this tool! Second, it would be great to know whether the ATAC-Seq data (controls and/or RNAi) will be browsable in any planarian databases or in a new website for other scientists. I believe that in addition to the data being used to test hypotheses about planarians, the data could also be a huge hypothesis generating resource in the planarian community, so I would encourage the authors to both self-promote their contribution and make plans to share it as widely and usably as possible.

      Thank you very much for this encouraging feedback. We appreciate the suggestion and have strengthened the text to emphasize the significance of generating this ATAC-seq resource for the planarian field. We agree that these datasets represent a valuable community resource and are committed to making all control and soxB1-2(RNAi) ATAC-seq data publicly accessible.

      Reviewer #2 (Significance (Required)):

      This paper's strengths are that it addresses an important problem in regenerative biology in a rigorous manner. The writing and presentation of the data are excellent. The paper also provides excellent datasets that will be very useful to other researchers in the fi eld. Finally, the work is one of, if not the first to examine how the action of one transcription factor in planarians leads to changes in the cellular and chromatin environment that could then be acted upon by subsequent factors. This is an important contribution to the planarian fi eld, but also one that will be useful for other developmental neuroscientists and regenerative biologists.

      I described a couple of limitations in the review above, but the strengths outweigh the weaknesses.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The authors investigated the role of soxB1-2 in planarian neural and epidermal lineage specification. Using ATAC-seq and RNA-seq from head fragments after soxB1-2 RNAi, they identified regions of decreased chromatin accessibility and reduced gene expression, demonstrating that soxB1-2 induces neural and sensory programs. Integration of the datasets yielded 31 overlapping candidate targets correlating ATAC-seq and RNA-seq. Downstream analyses of transcription factors that had either/or differentially accessible regulatory region or showed differential expression (castor and mecom) implicated these transcription factors in mechanosensory and ciliary modules. The authors combined additional techniques, such as in situ hybridization to support the observations based on the ATACseq/RNAseq data. The manuscript is clearly written as well as data presentation in the main and supplementary figures. The major claim of the manuscript is that SoxB1-2 is likely a pioneer transcription factor that alters the accessibility of the chromatin, which if true, would be one of the first demonstrations of direct transcriptional regulation in planarians. As described below, I am not certain that this interpretation of the data is more valid than alternative interpretations.

      Major comments

      1. Direct vs. indirect regulation. The current analysis does not distinguish between direct and indirect soxB1-2 targets, therefore, this analysis cannot indicate whether soxB1-2 functions as a pioneer transcription. ATAC-seq and RNA-seq, as performed here, do not determine whether reduced accessibility or downregulation of gene expression represents a change within existing cells or a reduction in the proportion of specific cell types in the libraries produced. This limitation should be explicitly recognized where causal statements are made. In fact, several pieces of information strongly suggest that indirect effects are abundant in the data: (1) the observed loss of accessibility and gene expression in late epidermal progenitors likely represent indirect effects, indicating that within the timeframe of the experiment, it is impossible (using these techniques) to distinguish between the scenarios. (2) The finding that castor knockdown reduces soxB1-2 expression likely reflects population loss rather than direct regulation, given overlapping expression domains. This further illustrates the difficulty in inferring directionality from such datasets. In order to provide evidence for a more direct association between soxB1-2 and the differentially accessible chromatin regions, a sequence(e.g., motif) analysis would be required. Other approaches to infer direct regulation would have been useful, but they are not available in planarians to the best of my knowledge.

      We agree that distinguishing between direct SoxB1-2 pioneer activity and indirect chromatin changes mediated by downstream factors is an important consideration. As suggested, examining the enrichment of SoxB1-2 binding motifs in regions that lose accessibility following soxB1-2(RNAi) can provide supporting evidence for direct regulation.

      While we did not conduct a direct abundance analysis of all potential chromatin-remodeling cofactors, we performed a motif enrichment analysis following the methodology of Neiro et al. (eLife, 2022), comparing control-specific and soxB1-2(RNAi)-specific accessible peak sets. Consistent with a direct role for SoxB1-2 in chromatin regulation, Sox-family motifs, particularly SoxB1-like motifs, were among the most significantly enriched in regions that maintain accessibility in control animals relative to soxB1-2(RNAi) animals.

      Evidence for pioneer activity. The authors correctly acknowledge that they do not present direct evidence of soxB1-2 binding or chromatin opening. However, the section title in the Discussion could be interpreted as implying otherwise. The claim of pioneer activity should remain explicitly tentative until supported (at least) by motif or binding data.

      We have performed suggested motif analysis and changed the language in this section to better fit the data.

      Replication and dataset comparability. Both ATAC-seq and soxB1-2 RNA-seq were performed on head fragments, but the number of replicates differ between assays (ATAC-seq n=2 per group, RNA-seq n=4-6). This is of course acceptable, but when interpreting the results, it should be taken into consideration that the statistical power is different when using data collected using different techniques and having a varied number of replicates.

      Thank you for raising this important point regarding replication and comparability across datasets. We agree that the differing number of biological replicates between the ATAC-seq and RNA-seq experiments results in different statistical power across assays. We have now clarified this consideration in the manuscript text.

      Minor comments

      "Thousands of accessible chromatin sites". Please state the number of peaks and the thresholds for calling them. Ensure consistency between text (264 DA peaks) and Figure 1 legend (269 DA peaks).

      __We have clarified specific peak numbers and will include the calling parameters in the methods section. Additionally, we will fix the discrepancies between differential peaks. __

      Specify the y-axis normalization units in all coverage plots.

      We have specified this across plots.

      Clarify replicate numbers consistently in the text and figure legends.

      We have identified and corrected discrepancies in the figure legends vs text and correct them and ensured they are included consistently across datasets.

      Referees cross commenting

      The reviews are highly consistent. They recognize the value of the work, and raise similar points. The main shared view is that the current data do not distinguish direct from indirect effects, and claims about pioneer activity should be softened, and further analysis of the differentially accessible peaks could strengthen the link between SoxB1-2 and the chromatin changes.

      -I don't think that it's necessary to further characterize experimentally mecom or castor (as suggested), but of course that it could have value.

      We thank all three reviewers for their positive assessment of the value of our work aiming to elucidate mechanisms by which SoxB1-2 programs planarian stem cells. In the revision, we have improved the presentation and carefully edited conclusions about the function of SoxB1-2. Performing motif analysis and GO annotation of upregulated genes has strengthened our observation that SoxB1-2 acts as an activator and has revealed putative binding sites.

      The preliminary revision does not yet include further characterization of mecom and castor downstream genes. In response to Reviewer #2, we appreciate that additional validation of the mecom and castor RNA-seq results and further phenotypic characterization would strengthen this section. Although we are currently conducting in situ hybridization experiments to validate transcriptional changes in mecom and castor using the same experimental framework applied to soxB1-2 downstream candidates, we also reconsidered, as we did in our first revision, whether this is necessary or better suited for future investigations.

      In the revision, we noted that our Discussion points were not balanced and that we emphasized the mecom and castor results in a manner that distracted from the major focus of the work, likely contributing to the impression that additional experimental evidence was required. Therefore, we have revised the section accordingly and streamlined the Discussion to avoid repetitive statements and to focus on the insights gained into the mechanism of SoxB1-2 function in planarian neurogenesis. We remain open to including these additional experiments if the reviewers or handling editors consider them essential; however, we agree that their inclusion is not absolutely necessary.

      Reviewer #3 (Significance (Required)):

      General assessment. The study offers valuable observations by combining chromatin and transcriptional analysis of planarian neural differentiation. The integration with in situ validation convincingly demonstrates effects on neural tissues and provides a solid resource for future functional work. However, mechanistic interpretation remains limited, partly because of technical limitations of the system. The data support an important role for soxB1-2 in neural and epidermal lineage regulation, but not direct binding or chromatin-opening activity. The authors have previously published analysis of soxB1-2 in planarians, so the addition of ATAC-seq data contributes to solving another piece of the puzzle.

      __Advance. __

      This is one of the first studies to couple ATAC-seq and RNA-seq in planarian tissue to dissect regulatory logic during regeneration. It identifies new candidate regulators of sensory and epidermal differentiation and identifies soxB1-2 as a likely upstream factor in ectodermal lineage networks. The work extends previous studies on soxB1-2 activity and neural cell production by integrating chromatin and transcriptional layers. In that respect the results are very solid, although the study remains correlative at the mechanistic level.

      Audience.

      This work will potentially interest researchers interested in regeneration and transcriptional networks. The datasets and gene lists will be valuable references for follow-up studies on planarian ectodermal lineages, and therefore will appeal to this community.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, the authors employed fast MAS NMR spectroscopy to investigate the gel aggregation of longer repeat (48×) RNAs, revealing inherent folding structures and interactions (i.e., G-quadruplex and duplex). The dynamic structure of the RNA gel was not resolved at high resolution, and only the structural features-namely, the coexistence of G-quadruplexes and duplexes-were inferred. The 1D and 2D NMR spectra were not assigned to specific atomic positions within the RNA, which makes it difficult to perform molecular dynamics (MD) modeling to elucidate the dynamic nature of the RNA gel. The following comments are provided for the authors' consideration:

      Reviewer #1, Comment 1:

      Figure 2E and Figure 3A: The data suggest that Ca²⁺ promotes stronger G-quadruplex formation within the RNA gel compared with Mg²⁺. This observation is somewhat puzzling, as Mg²⁺ is generally known to stabilize G-quadruplex structures. The authors should clarify this discrepancy.

      __Response: __Mg2+ is also a stabilizer of double-stranded RNA. In most cases, Mg²⁺ stabilizes RNA duplexes more significantly than it stabilizes G-quadruplexes. When Mg2+ is removed and replaced for Ca2+, RNA duplex is destabilized more than G4 structures. We have added a clarification regarding that to the Conclusions section.

      Reviewer #1, Comment 2:

      Figures 2 and 3: The authors use the chemical shift at δN 144.1 ppm to distinguish between G-quadruplex and duplex structures. How was the reliability of this assignment evaluated? Chemical shifts of RNA atoms can be influenced by various factors such as intermolecular interactions, conformational stress, and local chemical environment, not only by higher-order structures. This point should be substantiated by citing relevant references or by analyzing additional RNA structures exhibiting δN 144.1 ppm signals using NMR spectroscopy.

      Response: The assignment was made by comparing the chemical shifts with published data and by comparing the obtained spectra with existing datasets in the lab. We have added an explanation to the Results section and cited the literature. The 144.1 ppm was an illustrative value selected for guiding the discussion and we noted that it could sound too specific. We modified Figure 2 to outline the regions of chemical shifts in accordance with our interpretation of spectra.

      Reviewer #1, Comment 3:

      The authors state that "Our findings demonstrate that fast MAS NMR spectroscopy enables atomic-resolution monitoring of structural changes in GGGGCC repeat RNA of physiological lengths." This claim appears overstated, as no molecular model was constructed to define atomic coordinates based on NMR restraints.

      Response: We agree and we have rewritten the conclusions to be more precise in wording. The new text does not mention “atomic-resolution” anymore.

      Reviewer #1, Comment 4: Figure 3B: The experiment using nuclear extracts supplemented with Mg²⁺ to study RNA aggregation via 2D NMR may not accurately reflect intracellular conditions. It would be informative to perform a parallel experiment using nuclear extracts without additional Mg²⁺ to better simulate the native environment for RNA folding.

      __Response: __We agree that we have not yet approached physiological conditions and that it would be interesting to obtain data for conditions at physiological Mg2+ concentrations in the range between 0.5 mM – 1 mM. The buffer of purchased nuclear extracts does not contain MgCl2, so some MgCl2 would still need to be added. In our opinion, nuclear extracts are actually not the optimal way to move forward, since they still differ from real in cell environment with the caveat that their composition is not well controlled. Full reconstitution with recombinant proteins might be a better approach because stoichiometry can be better regulated.

      __Reviewer #1 (Significance (Required)): __ In this manuscript, the authors employed fast MAS NMR spectroscopy to investigate the gel aggregation of longer repeat (48×) RNAs, revealing inherent folding structures and interactions (i.e., G-quadruplex and duplex). The dynamic structure of the RNA gel was not resolved at high resolution, and only the structural features-namely, the coexistence of G-quadruplexes and duplexes-were inferred. The 1D and 2D NMR spectra were not assigned to specific atomic positions within the RNA, which makes it difficult to perform molecular dynamics (MD) modeling to elucidate the dynamic nature of the RNA gel.

      Response: We agree that constraints for molecular dynamics cannot be derived from these data. The focus of this work is methodological: to demonstrate how 1H-15N 2D correlation spectra can be used to characterize G-G pairing in RNA gels directly. Such spectra could be used to study effects of small molecules or interacting proteins for example.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __ The manuscript by Kragelj et al. has the potential to become a valuable study demonstrating the role and power of modern solid-state NMR spectroscopy in investigating molecular assemblies that are otherwise inaccessible to other structural biology techniques. However, due to poor experimental execution and incomplete data interpretation, the manuscript requires substantial revision before it can be considered for publication in any journal.

      __Reviewer #2, Major Concern __Inspection of the analytical gels of the transcribed RNA clearly shows that the desired RNA product constitutes only about 10% of the total crude transcript. The RNA must therefore be purified, for example by preparative PAGE, before performing any NMR or other biophysical studies. As it stands, all spectra shown in the figures represent a combined signal of all products in the crude mixture rather than the intended 48 repeat RNA. Consequently, all analyses and conclusions currently refer to a heterogeneous mixture of transcripts rather than the specific target RNA.

      Response: The estimate of 10% 48xG4C2 on the gel is an overstatement. While multiple bands are visible, they correspond to dimers or multimers of the 48xG4C2 RNA. Transcripts that are longer than 48xG4C2 cannot occur in our transcription conditions. Bands at lower masses than expected are folded RNA. The high repeat length and the presence of Mg²⁺ during transcription promote multimerization, which is not fully reversed by denaturation in urea. If shorter transcripts had arisen from early termination they would be still substantially longer than 24 repeats based of what is visible on the gel and would thus remain within the pathological length range. Therefore, the observed NMR spectra primarily report on 48 repeat lengths.

      __Reviewer #2, Specific Comments 1: __The statements: "We show that a technique called NMR spectroscopy under fast Magic Angle Spinning (fast MAS NMR) can be used to obtain structural information on GGGGCC repeat RNAs of physiological lengths. Fast MAS NMR can be used to obtain structural information on biomolecules regardless of their size." on page 1 are not entirely correct. Firstly, not only fast MAS NMR but MAS NMR in general can provide structural information on biomolecules regardless of their size. Fast MAS primarily allows for ¹H-detected experiments, improves spectral resolution, and reduces the required sample amount. Conventional ¹³C-detected solid-state MAS NMR can provide very similar structural information. A more thorough review of relevant literature could help address this issue.

      Response: We have clarified the distinction between MAS NMR and Fast MAS NMR in the introduction.

      __Reviewer #2, Specific Comments 2: __Secondly, MAS NMR has already been applied to systems of comparable complexity - for instance, the (CUG)₉₇ repeat studied by the Goerlach group as early as 2005. That work provided a comprehensive structural characterization of a similar molecular assembly. The authors are strongly encouraged to cite these studies (e.g., Riedel et al., J. Biomol. NMR, 2005; Riedel et al., Angew. Chem., 2006).

      Response: We added a mention of that study in the introduction.

      Reviewer #2, Experimental Description 1: The experimental details are poorly documented and need to be described in sufficient detail for reproducibility. Specifically: 1. What was the transcription scale? What was the yield (e.g., xx mg RNA per 1 mL transcription reaction)?

      Response: Between 3.5 mg and 4.5 mg per 10 ml transcription reaction. We’ve added this information to the methods.

      Reviewer #2, Experimental Description 2: 2. Why was the transcription product not purified? Dialysis only removes small molecules, while all macromolecular impurities above the cutoff remain. What was the dialysis cutoff used?

      Response: RNA was purified using dialysis and phenol-chloroform precipitation. We have added the information about molecular weight cutoff for dialysis membranes to the methods.

      Reviewer #2, Experimental Description 3: 3. How much RNA was used for each precipitation experiment? Were the amounts normalized? For example, if 10 mg of pellet were obtained, what fraction of that mass corresponded to RNA? Was this ratio consistent across all samples?

      Response: In the test gel formations, we used 180.0 µg per condition. We used 108.0 µg of RNA for gelation test in the presence of nuclear extracts. We have not determined the water content in the gels. We added this information to methods and results section.

      Reviewer #2, Experimental Description 4: 4. Why is there a smaller amount of precipitate when nuclear extract (NE) or CaCl₂ is added?

      Response: The apparent difference in pellet size may reflect variations in water content rather than RNA quantity. While the Figure 1 might entice to directly compare pellet weights across different ion series tests, our primary goal was to determine the minimal divalent-ion concentrations required to reproducibly obtain gels. We have added a clarification in the Results section and in the Figure 1 caption regarding the comparability of conditions

      Reviewer #2, Experimental Description 5: 5. The authors should describe NE addition in more detail: What is the composition of NE? What buffer was used (particularly Mg²⁺ and salt concentrations)? Was a control performed with NE buffer-type alone (without NE)?

      Response: We have added the full description of NE buffer to the methods section. Its composition is: 40 mM Tris pH 8.0, 100 mM KCl, 0.2 mM EDTA, 0.5 mM PMSF, 0.5 mM DTT, 25 % glycerol. After mixing the nuclear extract with RNA, the target buffer was: 20 mM Tris pH 8.0, 90 mM KCl, 0.1 mM EDTA, 0.25 mM PMSF, 0.75 mM DTT, 12.5% glycerol, and 10 mM MgCl2.

      We have not performed a control with NE buffer-type alone but we confirmed separately that glycerol does not affect gel formation.

      Reviewer #2, Experimental Description 6: 6. How much pellet/RNA material was actually packed into each MAS rotor?

      Response: Starting with a 5 mg pellet, we packed a rotor with a volume of 3 µl. We added this information to the methods section.

      Reviewer #2, Additional Clarifications: P5. What is meant by "selective" in the phrase "We recorded a selective 1D-¹H MAS NMR spectrum of 48×G₄C₂ RNA gels"?

      Response: That was a typo. We meant imino-selective. It is now corrected.

      __Reviewer #2, Additional Clarifications: __ There are also several contradictions between statements in the text and the corresponding figures. For example: • Page 4: The authors write that "The addition of at least 5 mM Mg²⁺ was required for significant 48×G₄C₂ aggregation." However, Figure 1E shows significant aggregation already at 3 mM MgCl₂ (NE−), and in samples containing NE, aggregation appears even at 1 mM MgCl₂. Was aggregation already present in the sample containing NE but without any added MgCl₂?

      Response: We changed text in the results section to more closely align with what’s depicted on the figure. There was some aggregation present in the nuclear extracts but it was of different quantity and quality. We clarified this in the results section.

      __Reviewer #2 (Significance (Required)): __ The manuscript by Kragelj et al. has the potential to become a valuable study demonstrating the role and power of modern solid-state NMR spectroscopy in investigating molecular assemblies that are otherwise inaccessible to other structural biology techniques.

      In its current form, tthe manuscript has significant experimental concerns - particularly the lack of RNA purification and inadequate description of materials and methods. The data therefore cannot support the conclusions presented. I recommend extensive revision and repetition of the experiments using purified RNA material before further consideration for publication.

      __Response: __We’ve addressed the concerns about RNA purification within the response to the first comment (Major concern).

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __ This is an interesting manuscript reporting evidence for formation of both hairpins and G-quadruplexes within RNA aggregates formed by ALS expansion repeats (GGGGCC)n. This is in line with literature but never directly confirmed. Given the novelty of the method (NMR magic angle) and of the data (NMR on aggregate), I believe this manuscript should be considered for publication. I also trust the methods are appropriately reported and reproducible.

      Below are my main points:

      Major points:

      __Reviewer #3, Comment 1: __ 1) RNA aggregation of the GGGGCCn repeat has been reported for expansion as short as 6-8 repeats (see Raguseo et al. Nat Commun 2023), so the authors might not see aggregation under the conditions they use for these shorter repeats but this can happen under physiological conditions . The ionic strengths and the conditions used can vary heavily the phase diagram and the authors therefore should tone down significantly their conclusions. They characterise one aggregate that is likely to contain both secondary structures under the conditions used (in terms of ion and pHs). However, it has been shown in Raguseo et al that aggregates can arise by both intermolecular G4s and hairpins (or a mixture of them) depending on the ionic conditions used. This means that what the authors report might not be necessarily relevant in cells, which should be caveated in the manuscript.

      __Response: __We toned down our statements regarding aggregation of shorter repeats in the introduction. We added the citation to Raguseo et al. Nat Commun 2023, which indeed provides useful insights about aggregation of GGGGCC repeats. In Supplementary Figure 1, we had data on gel formation with 8x and 24x repeats which showed these repeat lengths form gels to some extent. We oversimplified our conclusion and said there were no aggregates which needs correction, especially considering other studies reported in the literature have observed in vitro aggregation of these repeat lengths. We modified the results section to reflect this nuance.

      __Reviewer #3, Comment 2: __ 2) It would be important to perform perturbation experiments that might promote/disrupt formation of the G4 or hairpin and see if this affect RNA aggregation, which has been already reported by Raguseo et al, and wether this can be appreciated spectroscopically in their assay. This can be done by taking advantage of some of the experiments reported in the manuscript mentioned above, such as: PDS treatment (favouring monomolecular G4s and preventing aggregation), Li vs K treatment (favouring hairpin over G4s), NMM photo-oxidation (disassembling G4s) or addition of ALS relevant RNA binding proteins (i.e. TDP-43). Not all of these controls need to be performed but it would be good to reconcile how the fraction of G4 vs hairpin reflect aggregates' properties, since the authors offer such a nice technique to measure this.

      Response: We appreciate the reviewer’s suggestions and we would be eager to do the perturbation experiments in the future. However, these experiments would require additional optimization and waiting for approval and availability of measurement time on a high-field NMR spectrometer. Given that the primary goal of this manuscript is reporting on the methodological approach, we think the current data adequately demonstrate the technique’s utility.

      __Reviewer #3, Comment 3: __ 3) I disagree with the speculation of the monomolecular G4 being formed within the condensates, as the authors have no evidence to support this. It has been shown that n=8 repeat forms multimolecular G4s that are responsible of aggregation, so the authors need to provide direct evidence to support this hypothesis if they want to keep it in the manuscript, as it would clash with previous reports (Raguseo et al Nat Commun 2023)

      Response: We agree that multimolecular G4s contribute to aggregation in our 48xG4C2 gels. We also realized, after reading this comment, that the original presentation of data and schematics may have unintentionally suggested the presence of monomolecular G4 in our RNA gels. To address this, we have added a clarification to the results section, we modified Figure 2 and 3, and we included a new Supplementary Figure 4. For clarification, both multimolecular and monomolecular G4s in model oligonucleotides produce imino 1H and 15N chemical shifts in the same region and cannot be distinguished by the experiments used in our study. Based on the observations reported in the literature, we believe that G4s in 48xG4C2 form primarily intermolecularly, although direct experimental proof is not available with the present data.

      Minor points:

      __Reviewer #3, Comment 4: __ 4) An obvious omission in the literature is Raguseo et al Nat Commun 2023, extensively mentioned above. Given the relevance of the findings reported in this manuscript for this study, this should be appropriately referenced for clarity.

      Response: We’ve added the citation to Raguseo et al Nat Commun 2023 to the introduction where in vitro aggregation is discussed.

      __Reviewer #3, Comment 5: __ 5) The schematic in Figure 3 is somehow confusing and the structures reported and how they relate to aggregate formation is not clear. Given that in structural studies presentation and appearance is everything, I would strongly recommend to the authors to improve the clarity of the schematic for the benefit of the readers.

      Response: We thank you for your comment. We’ve modified the figure, and we hope it is now clearer.

      Providing that the authors can address the criticisms raised, I would be supportive of publication of this fine study.

      Reviewer #3 (Significance (Required)):

      The main strength of this paper is to provide direct evidence of DNA secondary structure formation within aggregates, which is something that has not been done before. This is important as it reconcile with the relevance of hairpin formation for the disease (reported by Disney and co-workers) and the relevance of G4-formation in the process of aggregation through multimolecular G4-formation (reported by Di Antonio and co-workers). Given the significance of the findings in this context and the novelty of the method applied to the study of RNA aggregation, this reviewer is supportive for publication of this manuscript and of its relevance to the field. I would be, however, more careful in the conclusions reported and would add additional controls to strengthen the conclusions.

      Response: We thank the reviewer for the comment. In the conclusion section, we have added a statement highlighting the potential roles of both double-stranded and G4 structures in gel formation, in line with what has been reported in previous studies.

    1. Author response:

      A major point all three reviewers raise is that the ‘human-AI collaboration’ in our experiment may not be true collaboration (as the AI does not classify images per se), but that it is only implied. The reviewers pointed out that whether participants were genuinely engaged in our experimental task is currently not sufficiently addressed. We plan to address this issue in the revised manuscript by including results from a brief interview we conducted after the experiment with each participant, which asked about the participant’s experience and decision-making processes while performing the task. Additionally, we also measured the participants’ propensity to trust in AI via a questionnaire before and after the experiment. The questionnaire and interview results will allow us to more accurately describe the involvement of our participants in the task. Additionally, we will conduct additional analyses of the behavioural data (e.g., response times) to show that participants genuinely completed the experimental task. Finally, we will work to sharpen our language and conclusions in the revised manuscript, following the reviewers’ recommendations.

      Reviewer #1:

      Summary:

      In the study by Roeder and colleagues, the authors aim to identify the psychophysiological markers of trust during the evaluation of matching or mismatching AI decision-making. Specifically, they aim to characterize through brain activity how the decision made by an AI can be monitored throughout time in a two-step decision-making task. The objective of this study is to unfold, through continuous brain activity recording, the general information processing sequence while interacting with an artificial agent, and how internal as well as external information interact and modify this processing. Additionally, the authors provide a subset of factors affecting this information processing for both decisions.

      Strengths:

      The study addresses a wide and important topic of the value attributed to AI decisions and their impact on our own confidence in decision-making. It especially questions some of the factors modulating the dynamical adaptation of trust in AI decisions. Factors such as perceived reliability, type of image, mismatch, or participants' bias toward one response or the other are very relevant to the question in human-AI interactions.

      Interestingly, the authors also question the processing of more ambiguous stimuli, with no real ground truth. This gets closer to everyday life situations where people have to make decisions in uncertain environments. Having a better understanding of how those decisions are made is very relevant in many domains.

      Also, the method for processing behavioural and especially EEG data is overall very robust and is what is currently recommended for statistical analyses for group studies. Additionally, authors provide complete figures with all robustness evaluation information. The results and statistics are very detailed. This promotes confidence, but also replicability of results.

      An additional interesting method aspect is that it is addressing a large window of analysis and the interaction between three timeframes (evidence accumulation pre-decision, decision-making, post-AI decision processing) within the same trials. This type of analysis is quite innovative in the sense that it is not yet a standard in complex experimental designs. It moves forward from classical short-time windows and baseline ERP analysis.

      We appreciate the constructive appraisal of our work.

      Weaknesses:

      R1.1. This manuscript raises several conceptual and theoretical considerations that are not necessarily answered by the methods (especially the task) used. Even though the authors propose to assess trust dynamics and violations in cooperative human-AI teaming decision-making, I don't believe their task resolves such a question. Indeed, there is no direct link between the human decision and the AI decision. They do not cooperate per se, and the AI decision doesn't seem, from what I understood to have an impact on the participants' decision making. The authors make several assumptions regarding trust, feedback, response expectation, and "classification" (i.e., match vs. mismatch) which seem far stretched when considering the scientific literature on these topics.

      This issue is raised by the other reviewers as well. The reviewer is correct in that the AI does not classify images but that the AI response is dependent on the participants’ choice (agree in 75% of trials, disagree in 25% of the trials). Importantly, though, participants were briefed before and during the experiment that the AI is doing its own independent image classification and that human input is needed to assess how well the AI image classification works. That is, participants were led to believe in a genuine, independent AI image classifier on this experiment.

      Moreover, the images we presented in the experiment were taken from previous work by Nightingale & Farid (2022). This image dataset includes ‘fake’ (AI generated) images that are indistinguishable from real images.

      What matters most for our work is that the participants were truly engaging in the experimental task; that is, they were genuinely judging face images, and they were genuinely evaluating the AI feedback. There is strong indication that this was indeed the case. We conducted and recorded brief interviews after the experiment, asking our participants about their experience and decision-making processes. The questions are as follows:

      (1) How did you make the judgements about the images?

      (2) How confident were you about your judgement?

      (3) What did you feel when you saw the AI response?

      (4) Did that change during the trials?

      (5) Who do you think it was correct?

      (6) Did you feel surprised at any of the AI responses?

      (7) How did you judge what to put for the reliability sliders?

      In our revised manuscript we will conduct additional analyses to provide detail on participants’ engagement in the task; both in the judging of the AI faces, as well as in considering the AI feedback. In addition, we will investigate the EEG signal and response time to check for effects that carry over between trials. We will also frame our findings more carefully taking scientific literature into account.

      Nightingale SJ, and Farid H. "AI-synthesized faces are indistinguishable from real faces and more trustworthy." Proceedings of the National Academy of Sciences 119.8 (2022): e2120481119.

      R1.2. Unlike what is done for the data processing, the authors have not managed to take the big picture of the theoretical implications of their results. A big part of this study's interpretation aims to have their results fit into the theoretical box of the neural markers of performance monitoring.

      We indeed used primarily the theoretical box of performance monitoring and predictive coding, since the make-up of our task is similar to a more classical EEG oddball paradigm. In our revised manuscript, we will re-frame and address the link of our findings with the theoretical framework of evidence accumulation and decision confidence.

      R1.3. Overall, the analysis method was very robust and well-managed, but the experimental task they have set up does not allow to support their claim. Here, they seem to be assessing the impact of a mismatch between two independent decisions.

      Although the human and AI decisions are independent in the current experiment, the EEG results still shed light on the participant’s neural processes, as long as the participant considers the AI’s decision and believes it to be genuine. An experiment in which both decisions carry effective consequences for the task and the human-AI cooperation would be an interesting follow-up study.

      Nevertheless, this type of work is very important to various communities. First, it addresses topical concerns associated with the introduction of AI in our daily life and decisions, but it also addresses methodological difficulties that the EEG community has been having to move slowly away from the static event-based short-timeframe analyses onto a more dynamic evaluation of the unfolding of cognitive processes and their interactions. The topic of trust toward AI in cooperative decision making has also been raised by many communities, and understanding the dynamics of trust, as well as the factors modulating it, is of concern to many high-risk environments, or even everyday life contexts. Policy makers are especially interested in this kind of research output.

      Reviewer #2:

      Summary:

      The authors investigated how "AI-agent" feedback is perceived in an ambiguous classification task, and categorised the neural responses to this. They asked participants to classify real or fake faces, and presented an AI-agent's feedback afterwards, where the AI-feedback disagreed with the participants' response on a random 25% of trials (called mismatches). Pre-response ERP was sensitive to participants' classification as real or fake, while ERPs after the AI-feedback were sensitive to AI-mismatches, with stronger N2 and P3a&b components. There was an interaction of these effects, with mismatches after a "Fake" response affecting the N2 and those after "Real" responses affecting P3a&b. The ERPs were also sensitive to the participants' response biases, and their subjective ratings of the AI agent's reliability.

      Strengths:

      The researchers address an interesting question, and extend the AI-feedback paradigm to ambiguous tasks without veridical feedback, which is closer to many real-world tasks. The in-depth analysis of ERPs provides a detailed categorisation of several ERPs, as well as whole-brain responses, to AI-feedback, and how this interacts with internal beliefs, response biases, and trust in the AI-agent.

      We thank the reviewer for their time in reading and reviewing our manuscript.

      Weaknesses:

      R2.1. There is little discussion of how the poor performance (close to 50% chance) may have affected performance on the task, such as by leading to entirely random guessing or overreliance on response biases. This can change how error-monitoring signals presented, as they are affected by participants' accuracy, as well as affecting how the AI feedback is perceived.

      The images were chosen from a previous study (Nightingale & Farid, 2022, PNAS) that looked specifically at performance accuracy and also found levels around 50%. Hence, ‘fake’ and ‘real’ images are indistinguishable in this image dataset. Our findings agree with the original study.

      Judging based on the brief interviews after the experiment (see answer to R.1.1.), all participants were actively and genuinely engaged in the task, hence, it is unlikely that they pressed buttons at random. As mentioned above, we will include a formal analysis of the interviews in the revised manuscript.

      The response bias might indeed play a role in how participants responded, and this might be related to their initial propensity to trust in AI. We have questionnaire data available that might shed light on this issue: before and after the experiment, all participants answered the following questions with a 5-point Likert scale ranging from ‘Not True’ to ‘Completely True’:

      (1) Generally, I trust AI.

      (2) AI helps me solve many problems.

      (3) I think it's a good idea to rely on AI for help.

      (4) I don't trust the information I get from AI.

      (5) AI is reliable.

      (6) I rely on AI.

      The propensity to trust questionnaire is adapted from Jessup SA, Schneider T R, Alarcon GM, Ryan TJ, & Capiola A. (2019). The measurement of the propensity to trust automation. International Conference on Human-Computer Interaction.

      Our initial analyses did not find a strong link between the initial (before the experiment) responses to these questions, and how images were rated during the experiment. We will re-visit this analysis and add the results to the revised manuscript.

      Regarding how error-monitoring (or the equivalent thereof in our experiment) is perceived, we will analyse interview questions 3 (“What did you feel when you saw the AI response”) and 6 (“Did you feel surprised at any of the AI responses”) and add results to the revised manuscript.

      The task design and performance make it hard to assess how much it was truly measuring "trust" in an AI agent's feedback. The AI-feedback is yoked to the participants' performance, agreeing on 75% of trials and disagreeing on 25% (randomly), which is an important difference from the framing provided of human-AI partnerships, where AI-agents usually act independently from the humans and thus disagreements offer information about the human's own performance. In this task, disagreements are uninformative, and coupled with the at-chance performance on an ambiguous task, it is not clear how participants should be interpreting disagreements, and whether they treat it like receiving feedback about the accuracy of their choices, or whether they realise it is uninformative. Much greater discussion and justification are needed about the behaviour in the task, how participants did/should treat the feedback, and how these affect the trust/reliability ratings, as these are all central to the claims of the paper.

      In our experiment, the AI disagreements are indeed uninformative for the purpose of making a correct judgment (that is, correctly classifying images as real or fake). However, given that the AI-generated faces are so realistic and indistinguishable from the real faces, the correctness of the judgement is not the main experimental factor in this study. We argue that, provided participants were genuinely engaged in the task, their judgment accuracy is less important than their internal experience when the goal is to examine processes occurring within the participants themselves. We briefed our participants as follows before the experiment:

      “Technology can now create hyper-realistic images of people that do not exist. We are interested in your view on how well our AI system performs at identifying whether images of people’s faces are real or fake (computer-generated). Human input is needed to determine when a face looks real or fake. You will be asked to rate images as real or fake. The AI system will also independently rate the images. You will rate how reliable the AI is several times throughout the experiment.”

      We plan to more fully expand the behavioural aspect and our participants’ experience in the revised manuscript by reporting the brief post-experiment interview (R.1.1.), the propensity to trust questionnaire (R.2.1.), and additional analyses of the response times.

      There are a lot of EEG results presented here, including whole-brain and window-free analyses, so greater clarity on which results were a priori hypothesised should be given, along with details on how electrodes were selected for ERPs and follow-up tests.

      We chose the electrodes mainly to be consistent across findings, and opted to use central electrodes (Pz and Fz), as long as the electrode was part of the electrodes within the reported cluster. We can in our revised manuscript also report on the electrodes with the maximal statistic, as part of a more complete and descriptive overview. We will also report on where we expected to see ERP components within the paper. In short, we did expect something like a P3, and we did also expect to see something before the response what we call the CPP. The rest of the work was more exploratory, with a more careful expectation that bias would be connected to the CPP, and the reliability ratings more to the P3; however, we find the opposite results. We will include this in our revised work as well.

      We selected the electrodes primarily to maintain consistency across our findings and figures, and focused on central electrodes (Pz and Fz), provided they fell within the reported cluster. In the revised manuscript, we will also report the electrodes showing the maximal statistical effects to give a more complete and descriptive overview. Additionally, we will report where we expected specific ERP components to appear. In brief, we expected to see a P3 component post AI feedback, and a pre-response signal corresponding to the CPP. Beyond these expectations, the remaining analyses were more exploratory. Although we tentatively expected bias to relate to the CPP and reliability ratings to the P3, our results showed the opposite pattern. We will clarify this in the revised version of the manuscript.

      Reviewer #3:

      The current paper investigates neural correlates of trust development in human-AI interaction, looking at EEG signatures locked to the moment that AI advice is presented. The key finding is that both human-response-locked EEG signatures (the CPP) and post-AI-advice signatures (N2, P3) are modulated by trust ratings. The study is interesting, however, it does have some clear and sometimes problematic weaknesses:

      (1) The authors did not include "AI-advice". Instead, a manikin turned green or blue, which was framed as AI advice. It is unclear whether participants viewed this as actual AI advice.

      This point has been raised by the other reviewers as well, and we refer to the answers under R1.1., and under R2.1. We will address this concern by analysing the post-experiment interviews. In particular, questions 3 (“What did you feel when you saw the AI response”), 4 (“Did that change during the trials?”) and 6 (“Did you feel surprised at any of the AI responses”) will give critical insight. As stated above, our general impression from conducting the interviews is that all participants considered the robot icon as decision from an independent AI agent.

      (2) The authors did not include a "non-AI" control condition in their experiment, such that we cannot know how specific all of these effects are to AI, or just generic uncertain feedback processing.

      In the conceptualization phase of this study, we indeed considered different control conditions for our experiment to contrast different kinds of feedback. However, previous EEG studies on performance monitoring ERPs have reported similar results for human and machine supervision (Somon et al., 2019; de Visser et al., 2018). We therefore decided to focus on one aspect (the judgement of observation of an AI classification), also to prevent the experiment from taking too long and risking that participants would lose concentration and motivation to complete the experiment. Comparing AI vs non-AI feedback, is still interesting and would be a valuable follow-up study.

      Somon B, et al. "Human or not human? Performance monitoring ERPs during human agent and machine supervision." NeuroImage 186 (2019): 266-277.

      De Visser EJ, et al. "Learning from the slips of others: Neural correlates of trust in automated agents." Frontiers in human neuroscience 12 (2018): 309.

      (3) Participants perform the task at chance level. This makes it unclear to what extent they even tried to perform the task or just randomly pressed buttons. These situations likely differ substantially from a real-life scenario where humans perform an actual task (which is not impossible) and receive actual AI advice.

      This concern was also raised by the other two reviewers. As already stated in our responses above, we will add results from the post-experiment interviews with the participants, the propensity to trust questionnaire, and additional behavioural analyses in our revised manuscript.

      Reviewer 1 (R1.3) also brought up the situation where decisions by the participant and the AI have a more direct link which carries consequences. This will be valuable follow-up research. In the revised manuscript, we will more carefully frame our approach.

      (4) Many of the conclusions in the paper are overstated or very generic.

      In the revised manuscript, we will re-phrase our discussion and conclusions to address the points raised in the reviewer’s recommendations to authors.

    1. While there is no easy exit from the morass of racial politics inNorth America and the roles assigned to teachers of writing, reading,and speaking within that morass, there are alternatives to thoughtlesslygoing along. If there is insufficient work within the field of writing stud-ies to teach us how to think more deeply and effectively about antiracistpedagogical practice in the writing centre, then perhaps we may findaid in published scholarship outside the field, as well as inspiration anda firmer footing for producing our own.

      Racism in education, and introduces race into education.

  4. Nov 2025
    1. when we are immersed in something, surrounded by it the waywe are by images from the media, we may come to accept them as just part ofthe real and natural world.

      This line makes sense to me because it explains how easy it is to stop questioning the media we see everyday. When something is constantly shown, like stereotypes in movies or the way certain groups are shown. It starts to feel normal even if it's totally inaccurate. This makes Hall's point, that we have to step back and actually think about what we're being shown instead of just absorbing it without realizing, clear.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      From a forward genetic mosaic mutant screen using EMS, the authors identify mutations in glucosylceramide synthase (GlcT), a rate-limiting enzyme for glycosphingolipid (GSL) production, that result in EE tumors. Multiple genetic experiments strongly support the model that the mutant phenotype caused by GlcT loss is due to by failure of conversion of ceramide into glucosylceramide. Further genetic evidence suggests that Notch signaling is comprised in the ISC lineage and may affect the endocytosis of Delta. Loss of GlcT does not affect wing development or oogenesis, suggesting tissue-specific roles for GlcT. Finally, an increase in goblet cells in UGCG knockout mice, not previously reported, suggests a conserved role for GlcT in Notch signaling in intestinal cell lineage specification.

      Strengths:

      Overall, this is a well-written paper with multiple well-designed and executed genetic experiments that support a role for GlcT in Notch signaling in the fly and mammalian intestine. I do, however, have a few comments below.

      Weaknesses:

      (1) The authors bring up the intriguing idea that GlcT could be a way to link diet to cell fate choice. Unfortunately, there are no experiments to test this hypothesis.

      We indeed attempted to establish an assay to investigate the impact of various diets (such as high-fat, high-sugar, or high-protein diets) on the fate choice of ISCs. Subsequently, we intended to examine the potential involvement of GlcT in this process. However, we observed that the number or percentage of EEs varies significantly among individuals, even among flies with identical phenotypes subjected to the same nutritional regimen. We suspect that the proliferative status of ISCs and the turnover rate of EEs may significantly influence the number of EEs present in the intestinal epithelium, complicating the interpretation of our results. Consequently, we are unable to conduct this experiment at this time. The hypothesis suggesting that GlcT may link diet to cell fate choice remains an avenue for future experimental exploration.

      (2) Why do the authors think that UCCG knockout results in goblet cell excess and not in the other secretory cell types?

      This is indeed an interesting point. In the mouse intestine, it is well-documented that the knockout of Notch receptors or Delta-like ligands results in a classic phenotype characterized by goblet cell hyperplasia, with little impact on the other secretory cell types. This finding aligns very well with our experimental results, as we noted that the numbers of Paneth cells and enteroendocrine cells appear to be largely normal in UGCG knockout mice. By contrast, increases in other secretory cell types are typically observed under conditions of pharmacological inhibition of the Notch pathway.

      (3) The authors should cite other EMS mutagenesis screens done in the fly intestine.

      To our knowledge, the EMS screen on 2L chromosome conducted in Allison Bardin’s lab is the only one prior to this work, which leads to two publications (Perdigoto et al., 2011; Gervais, et al., 2019). We have now included citations for both papers in the revised manuscript.

      (4) The absence of a phenotype using NRE-Gal4 is not convincing. This is because the delay in its expression could be after the requirement for the affected gene in the process being studied. In other words, sufficient knockdown of GlcT by RNA would not be achieved until after the relevant signaling between the EB and the ISC occurred. Dl-Gal4 is problematic as an ISC driver because Dl is expressed in the EEP.

      This is an excellent point, and we agree that the lack of an observable phenotype using NRE-Gal4 could be due to delayed expression, which may result in missing the critical window required for effective GlcT knockdown. Consequently, we cannot rule out the possibility that GlcT also plays a role in early EBs or EEPs. We have revised the manuscript to soften this conclusion and to include this alternative explanation for the experiment.

      (5) The difference in Rab5 between control and GlcT-IR was not that significant. Furthermore, any changes could be secondary to increases in proliferation.

      We agree that it is possible that the observed increase in proliferation could influence the number of Rab5+ endosomes, and we will temper our conclusions on this aspect accordingly. However, it is important to note that, although the difference in Rab5+ endosomes between the control and GlcT-IR conditions appeared mild, it was statistically significant and reproducible. In our revised experiments, we have not only added statistical data and immunofluorescence images for Rab11 but also unified the approaches used for detecting Rab-associated proteins (in the previous figures, Rab5 was shown using U-Rab5-GFP, whereas Rab7 was detected by direct antibody staining). Based on this unified strategy, we optimized the quantification of Dl-GFP colocalization with early, late, and recycling endosomes, and the results are consistent with our previous observations (see the updated Fig. 5).

      Reviewer #2 (Public review):

      Summary:

      This study genetically identifies two key enzymes involved in the biosynthesis of glycosphingolipids, GlcT and Egh, which act as tumor suppressors in the adult fly gut. Detailed genetic analysis indicates that a deficiency in Mactosyl-ceramide (Mac-Cer) is causing tumor formation. Analysis of a Notch transcriptional reporter further indicates that the lack of Mac-Ser is associated with reduced Notch activity in the gut, but not in other tissues.

      Addressing how a change in the lipid composition of the membranes might lead to defective Notch receptor activation, the authors studied the endocytic trafficking of Delta and claimed that internalized Delta appeared to accumulate faster into endosomes in the absence of Mac-Cer. Further analysis of Delta steady-state accumulation in fixed samples suggested a delay in the endosomal trafficking of Delta from Rab5+ to Rab7+ endosomes, which was interpreted to suggest that the inefficient, or delayed, recycling of Delta might cause a loss in Notch receptor activation.

      Finally, the histological analysis of mouse guts following the conditional knock-out of the GlcT gene suggested that Mac-Cer might also be important for proper Notch signaling activity in that context.

      Strengths:

      The genetic analysis is of high quality. The finding that a Mac-Cer deficiency results in reduced Notch activity in the fly gut is important and fully convincing.

      The mouse data, although preliminary, raised the possibility that the role of this specific lipid may be conserved across species.

      Weaknesses:

      This study is not, however, without caveats and several specific conclusions are not fully convincing.

      First, the conclusion that GlcT is specifically required in Intestinal Stem Cells (ISCs) is not fully convincing for technical reasons: NRE-Gal4 may be less active in GlcT mutant cells, and the knock-down of GlcT using Dl-Gal4ts may not be restricted to ISCs given the perdurance of Gal4 and of its downstream RNAi.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and explicitly described this possibility in the updated version.

      Second, the results from the antibody uptake assays are not clear.: i) the levels of internalized Delta were not quantified in these experiments; ii) additionally, live guts were incubated with anti-Delta for 3hr. This long period of incubation indicated that the observed results may not necessarily reflect the dynamics of endocytosis of antibody-bound Delta, but might also inform about the distribution of intracellular Delta following the internalization of unbound anti-Delta. It would thus be interesting to examine the level of internalized Delta in experiments with shorter incubation time.

      We thank the reviewer for these excellent questions. In our antibody uptake experiments, we noted that Dl reached its peak accumulation after a 3-hour incubation period. We recognize that quantifying internalized Dl would enhance our analysis, and we will include the corresponding statistical graphs in the revised version of the manuscript. In addition, we agree that during the 3-hour incubation, the potential internalization of unbound anti-Dl cannot be ruled out, as it may influence the observed distribution of intracellular Dl. We therefore attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.

      Overall, the proposed working model needs to be solidified as important questions remain open, including: is the endo-lysosomal system, i.e. steady-state distribution of endo-lysosomal markers, affected by the Mac-Cer deficiency? Is the trafficking of Notch also affected by the Mac-Cer deficiency? is the rate of Delta endocytosis also affected by the Mac-Cer deficiency? are the levels of cell-surface Delta reduced upon the loss of Mac-Cer?

      Regarding the impact on the endo-lysosomal system, this is indeed an important aspect to explore. While we did not conduct experiments specifically designed to evaluate the steady-state distribution of endo-lysosomal markers, our analyses utilizing Rab5-GFP overexpression and Rab7 staining did not indicate any significant differences in endosome distribution in MacCer deficient conditions. Moreover, we still observed high expression of the NRE-LacZ reporter specifically at the boundaries of clones in GlcT mutant cells (Fig. 4A), indicating that GlcT mutant EBs remain responsive to Dl produced by normal ISCs located right at the clone boundary. Therefore, we propose that MacCer deficiency may specifically affect Dl trafficking without impacting Notch trafficking.

      In our 3-hour antibody uptake experiments, we observed a notable decrease in cell-surface Dl, which was accompanied by an increase in intracellular accumulation. These findings collectively suggest that Dl may be unstable on the cell surface, leading to its accumulation in early endosomes.

      Third, while the mouse results are potentially interesting, they seem to be relatively preliminary, and future studies are needed to test whether the level of Notch receptor activation is reduced in this model.

      In the mouse small intestine, Olfm4 is a well-established target gene of the Notch signaling pathway, and its staining provides a reliable indication of Notch pathway activation. While we attempted to evaluate Notch activation using additional markers, such as Hes1 and NICD, we encountered difficulties, as the corresponding antibody reagents did not perform well in our hands. Despite these challenges, we believe that our findings with Olfm4 provide an important start point for further investigation in the future.

      Reviewer #3 (Public review):

      Summary:

      In this paper, Tang et al report the discovery of a Glycoslyceramide synthase gene, GlcT, which they found in a genetic screen for mutations that generate tumorous growth of stem cells in the gut of Drosophila. The screen was expertly done using a classic mutagenesis/mosaic method. Their initial characterization of the GlcT alleles, which generate endocrine tumors much like mutations in the Notch signaling pathway, is also very nice. Tang et al checked other enzymes in the glycosylceramide pathway and found that the loss of one gene just downstream of GlcT (Egh) gives similar phenotypes to GlcT, whereas three genes further downstream do not replicate the phenotype. Remarkably, dietary supplementation with a predicted GlcT/Egh product, Lactosyl-ceramide, was able to substantially rescue the GlcT mutant phenotype. Based on the phenotypic similarity of the GlcT and Notch phenotypes, the authors show that activated Notch is epistatic to GlcT mutations, suppressing the endocrine tumor phenotype and that GlcT mutant clones have reduced Notch signaling activity. Up to this point, the results are all clear, interesting, and significant. Tang et al then go on to investigate how GlcT mutations might affect Notch signaling, and present results suggesting that GlcT mutation might impair the normal endocytic trafficking of Delta, the Notch ligand. These results (Fig X-XX), unfortunately, are less than convincing; either more conclusive data should be brought to support the Delta trafficking model, or the authors should limit their conclusions regarding how GlcT loss impairs Notch signaling. Given the results shown, it's clear that GlcT affects EE cell differentiation, but whether this is via directly altering Dl/N signaling is not so clear, and other mechanisms could be involved. Overall the paper is an interesting, novel study, but it lacks somewhat in providing mechanistic insight. With conscientious revisions, this could be addressed. We list below specific points that Tang et al should consider as they revise their paper.

      Strengths:

      The genetic screen is excellent.

      The basic characterization of GlcT phenotypes is excellent, as is the downstream pathway analysis.

      Weaknesses:

      (1) Lines 147-149, Figure 2E: here, the study would benefit from quantitations of the effects of loss of brn, B4GalNAcTA, and a4GT1, even though they appear negative.

      We have incorporated the quantifications for the effects of the loss of brn, B4GalNAcTA, and a4GT1 in the updated Figure 2.

      (2) In Figure 3, it would be useful to quantify the effects of LacCer on proliferation. The suppression result is very nice, but only effects on Pros+ cell numbers are shown.

      We have now added quantifications of the number of EEs per clone to the updated Figure 3.

      (3) In Figure 4A/B we see less NRE-LacZ in GlcT mutant clones. Are the data points in Figure 4B per cell or per clone? Please note. Also, there are clearly a few NRE-LacZ+ cells in the mutant clone. How does this happen if GlcT is required for Dl/N signaling?

      In Figure 4B, the data points represent the fluorescence intensity per single cell within each clone. It is true that a few NRE-LacZ+ cells can still be observed within the mutant clone; however, this does not contradict our conclusion. As noted, high expression of the NRE-LacZ reporter was specifically observed around the clone boundaries in MacCer deficient cells (Fig. 4A), indicating that the mutant EBs can normally receive Dl signal from the normal ISCs located at the clone boundary and activate the Notch signaling pathway. Therefore, we believe that, although affecting Dl trafficking, MacCer deficiency does not significantly affect Notch trafficking.

      (4) Lines 222-225, Figure 5AB: The authors use the NRE-Gal4ts driver to show that GlcT depletion in EBs has no effect. However, this driver is not activated until well into the process of EB commitment, and RNAi's take several days to work, and so the author's conclusion is "specifically required in ISCs" and not at all in EBs may be erroneous.

      As previously mentioned, we acknowledge that a role for GlcT in early EBs or EEPs cannot be completely ruled out. We have revised our manuscript to present a more cautious conclusion and described this possibility in the updated version.

      (5) Figure 5C-F: These results relating to Delta endocytosis are not convincing. The data in Fig 5C are not clear and not quantitated, and the data in Figure 5F are so widely scattered that it seems these co-localizations are difficult to measure. The authors should either remove these data, improve them, or soften the conclusions taken from them. Moreover, it is unclear how the experiments tracing Delta internalization (Fig 5C) could actually work. This is because for this method to work, the anti-Dl antibody would have to pass through the visceral muscle before binding Dl on the ISC cell surface. To my knowledge, antibody transcytosis is not a common phenomenon.

      We thank the reviewer for these insightful comments and suggestions. In our in vivo experiments, we observed increased co-localization of Rab5 and Dl in GlcT mutant ISCs, indicating that Dl trafficking is delayed at the transition to Rab7⁺ late endosomes, a finding that is further supported by our antibody uptake experiments. We acknowledge that the data presented in Fig. 5C are not fully quantified and that the co-localization data in Fig. 5F may appear somewhat scattered; therefore, we have included additional quantification and enhanced the data presentation in the revised manuscript.

      Regarding the concern about antibody internalization, we appreciate this point. We currently do not know if the antibody reaches the cell surface of ISCs by passing through the visceral muscle or via other routes. Given that the experiment was conducted with fragmented gut, it is possible that the antibody may penetrate into the tissue through mechanisms independent of transcytosis.

      As mentioned earlier, we attempted to supplement our findings with live imaging experiments to investigate the dynamics of Dl/Notch endocytosis in both normal and GlcT mutant ISCs. However, we found that the GFP expression level of Dl-GFP (either in the knock-in or transgenic line) was too low to be reliably tracked. During the three-hour observation period, the weak GFP signal remained largely unchanged regardless of the GlcT mutation status, and the signal resolution under the microscope was insufficient to clearly distinguish membrane-associated from intracellular Dl. Therefore, we were unable to obtain a dynamic view of Dl trafficking through live imaging. Nevertheless, our Dl antibody uptake and endosomal retention analyses collectively support the notion that MacCer influences Notch signaling by regulating Dl endocytosis.

      (6) It is unclear whether MacCer regulates Dl-Notch signaling by modifying Dl directly or by influencing the general endocytic recycling pathway. The authors say they observe increased Dl accumulation in Rab5+ early endosomes but not in Rab7+ late endosomes upon GlcT depletion, suggesting that the recycling endosome pathway, which retrieves Dl back to the cell surface, may be impaired by GlcT loss. To test this, the authors could examine whether recycling endosomes (marked by Rab4 and Rab11) are disrupted in GlcT mutants. Rab11 has been shown to be essential for recycling endosome function in fly ISCs.

      We agree that assessing the state of recycling endosomes, especially by using markers such as Rab11, would be valuable in determining whether MacCer regulates Dl-Notch signaling by directly modifying Dl or by influencing the broader endocytic recycling pathway. In the newly added experiments, we found that in GlcT-IR flies, Dl still exhibits partial colocalization with Rab11, and the overall expression pattern of Rab11 is not affected by GlcT knockdown (Fig. 5E-F). These observations suggest that MacCer specifically regulates Dl trafficking rather than broadly affecting the recycling pathway.

      (7) It remains unclear whether Dl undergoes post-translational modification by MacCer in the fly gut. At a minimum, the authors should provide biochemical evidence (e.g., Western blot) to determine whether GlcT depletion alters the protein size of Dl.

      While we propose that MacCer may function as a component of lipid rafts, facilitating Dl membrane anchorage and endocytosis, we also acknowledge the possibility that MacCer could serve as a substrate for protein modifications of Dl necessary for its proper function. Conducting biochemical analyses to investigate potential post-translational modifications of Dl by MacCer would indeed provide valuable insights. We have performed Western blot analysis to test whether GlcT depletion affects the protein size of Dl. As shown below, we did not detect any apparent changes in the molecular weight of the Dl protein. Therefore, it is unlikely that MacCer regulates post-translational modifications of Dl.

      Author response image 1.

      To investigate whether MacCer modifies Dl by Western blot,(A) Four lanes were loaded: the first two contained 20 μL of membrane extract (lane 1: GlcT-IR, lane 2: control), while the last two contained 10 μL of membrane extract (B) Full blot images are shown under both long and shortexposure conditions.

      (8) It is unfortunate that GlcT doesn't affect Notch signaling in other organs on the fly. This brings into question the Delta trafficking model and the authors should note this. Also, the clonal marker in Figure 6C is not clear.

      In the revised working model, we have explicitly described that the events occur in intestinal stem cells. Regarding Figure 6C, we have delineated the clone with a white dashed line to enhance its clarity and visual comprehension.

      (9) The authors state that loss of UGCG in the mouse small intestine results in a reduced ISC count. However, in Supplementary Figure C3, Ki67, a marker of ISC proliferation, is significantly increased in UGCG-CKO mice. This contradiction should be clarified. The authors might repeat this experiment using an alternative ISC marker, such as Lgr5.

      Previous studies have indicated that dysregulation of the Notch signaling pathway can result in a reduction in the number of ISCs. While we did not perform a direct quantification of ISC numbers in our experiments, our Olfm4 staining—which serves as a reliable marker for ISCs—demonstrates a clear reduction in the number of positive cells in UGCG-CKO mice.

      The increased Ki67 signal we observed reflects enhanced proliferation in the transit-amplifying region, and it does not directly indicate an increase in ISC number. Therefore, in UGCG-CKO mice, we observe a decrease in the number of ISCs, while there is an increase in transit-amplifying (TA) cells (progenitor cells). This increase in TA cells is probably a secondary consequence of the loss of barrier function associated with the UGCG knockout.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3 (Recommendations for the authors):

      The authors have done an excellent job of addressing most comments, but my concerns about Figure 5 remain. I appreciate the authors' efforts to address the problem involving Rs being part of the computation on both the x and y axes of Figure 5, but addressing this via simulation addresses statistical significance but overlooks effect size. I think the authors may have misunderstood my original suggestion, so I will attempt to explain it better here. Since "Rs" is an average across all trials, the trials could be subdivided in two halves to compute two separate averages - for example, an average of the even numbered trials and an average of the odd numbered trials. Then you would use the "Rs" from the even numbered trials for one axis and the "Rs" from the odd numbered trials for the other. You would then plot R-Rs_even vs Rf-Rs_odd. This would remove the confound from this figure, and allow the text/interpretation to be largely unchanged (assuming the results continue to look as they do).

      We have added a description and the result of the new analysis (line #321 to #332), and a supplementary figure (Suppl. Fig. 1) (line #1464 to #1477). 

      “We calculated 𝑅<sub>𝑠</sub> in the ordinate and abscissa of Figure 5A-E using responses averaged across different subsets of trials, such that 𝑅<sub>𝑠</sub> was no longer a common term in the ordinate and abscissa. For each neuron, we determined 𝑅<sub>𝑠1</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across half of the recorded trials, selected randomly. We also determined 𝑅<sub>𝑠2</sub> by averaging the firing rates of 𝑅<sub>𝑠</sub> across the rest of the trials.  We regressed (𝑅 − 𝑅<sub>𝑠1</sub> )  on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠2</sub>) , as well as (𝑅<sub>𝑠</sub> - 𝑅<sub>𝑠2</sub>)  on (𝑅<sub>𝑓</sub> − 𝑅<sub>𝑠1</sub>), and repeated the procedure 50 times. The averaged slopes obtained with 𝑅<sub>𝑠</sub> from the split trials showed the same pattern as those using 𝑅<sub>𝑠</sub> from all trials (Table 1 and Supplementary Fig. 1), although the coefficient of determination was slightly reduced (Table 1). For ×4 speed separation, the slopes were nearly identical to those shown in Figure 5F1. For ×2 speed separation, the slopes were slightly smaller than those in Figure 5F2, but followed the same pattern (Supplementary Fig. 1). Together, these analysis results confirmed the faster-speed bias at the slow stimulus speeds, and the change of the response weights as stimulus speeds increased.”

      An additional remaining item concerns the terminology weighted sum, in the context of the constraint that wf and ws must sum to one. My opinion is that it is non-standard to use weighted sum when the computation is a weighted average, but as long as the authors make their meaning clear, the reader will be able to follow. I suggest adding some phrasing to explain to the reader the shift in interpretation from the more general weighted sum to the more constrained weighted average. Specifically, "weighted sum" first appears on line 268, and then the additional constraint of ws + wf =1 is introduced on line 278. Somewhere around line 278, it would be useful to include a sentence stating that this constraint means the weighted sum is constrained to be a weighted average.

      Thanks for the suggestion. We have modified the text as follows. Since we made other modifications in the text, the line numbers are slightly different from the last version. 

      Line #274 to 275: 

      “Since it is not possible to solve for both variables, 𝑤<sub>𝑠</sub> and 𝑤<sub>𝑓</sub>, from a single equation (Eq. 5) with three data points, we introduced an additional constraint: 𝑤<sub>𝑠</sub> + 𝑤<sub>𝑓</sub> =1. With this constraint, the weighted sum becomes a weighted average.”

      Also on line #309:

      “First, at each speed pair and for each of the 100 neurons in the data sample shown in Figure 5, we simulated the response to the bi-speed stimuli (𝑅<sub>𝑒</sub>) as a randomly weighted average of 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> of the same neuron. 

      in which 𝑎 was a randomly generated weight (between 0 and 1) for 𝑅<sub>𝑓</sub>, and the weights for 𝑅<sub>𝑓</sub> and 𝑅<sub>𝑠</sub> summed to one.”

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)): The authors map the ZFP36L1 protein interactome in human T cells using UltraID proximity labeling combined with quantitative mass spectrometry. They optimize labeling conditions in primary T cells, profile resting and activated cells, and include a time course at 2, 5, and 16 hours. They complement the interactome with co-immunoprecipitation in the presence or absence of RNase to assess RNA dependence. They then test selected candidates using CRISPR knockouts in primary T cells, focusing on UPF1 and GIGYF1/2, and report effects on global translation, stress, activation markers, and ZFP36L1 protein levels. The work argues that ZFP36L1 sits at the center of multiple post-transcriptional pathways in T cells (which in itself is not a novel finding) and that UPF1 supports ZFP36L1 expression at the mRNA and protein level. The main model system is primary human T cells, with some data in Jurkat cells.

      The core datasets show thousands of identified proteins in total lysates and enriched biotinylated fractions. Known partners from CCR4-NOT, decapping, stress granules, and P-bodies appear, with additional candidates like GIGYF1/2, PATL1, DDX6, and UPF1. Time-resolved labeling suggests shifts in proximity during early activation. Co-IP with and without RNase suggests both RNA-dependent and RNA-independent contacts. CRISPR loss of UPF1 or GIGYF1/2 increases translation at rest and elevates activation markers, and UPF1 loss reduces ZFP36L1 protein and mRNA while MG132 does not rescue protein levels; UPF1 RIP enriches ZFP36L1 mRNA.

      Among patterns worth noting are that the activation state drives the principal variance in both proteome and proximity datasets. Deadenylation, decapping, and granule proteins are consistently near ZFP36L1 across conditions, while some contacts dip at 2 hours and recover by 5 to 16 hours. Mitochondrial ribosomal proteins become more proximal later. UPF1 and GIGYF1 show time-linked behavior and RNase sensitivity that fits roles in mRNA surveillance and translational control. These observations support a dynamic hub model, though they remain proximity-based rather than direct binding maps.

      We thank the reviewer for their careful reading and thoughtful summary. Please find our point-to point response below.

      Major comments

      1) The key conclusions are directionally convincing for a broad and dynamic ZFP36L1 neighborhood in human T cells. The data robustly recover established complexes and add plausible candidates. The time-course and RNase experiments strengthen the claim that interactions shift with activation state and RNA context. The functional tests around UPF1 and GIGYF1/2 point to biological relevance. That said, some statements could be qualified. The statement that ZFP36L1 "coordinates" multiple pathways implies mechanism and directionality that proximity data alone cannot prove. I suggest reframing as "positions ZFP36L1 within" or "supports a model where ZFP36L1 sits within" these networks.

      We thank this reviewer for considering our data ‘directionally convincing, and robust, adding new plausible candidates as interactors with ZFP36L1’. We agree that the proposed wording is more appropriate and will change it accordingly.

      2) UPF1, as an upstream regulator of ZFP36L1 expression, is a promising lead. The reduction of ZFP36L1 protein and mRNA in UPF1 knockout, the non-rescue by MG132, and the UPF1 RIP on ZFP36L1 mRNA together argue that UPF1 influences ZFP36L1 transcript output or processing. This claim would read stronger with one short rescue or perturbation that pins the mechanism. A compact test would be UPF1 re-expression in UPF1-deficient T cells with wild-type and helicase-dead alleles. This is realistic in primary T cells using mRNA electroporation or virus-based systems. Approximate time 2 to 3 weeks, including guide design check and expansion. Reagents and sequencing about 2 to 4k USD depending on donor numbers. This would help separate viability or stress effects from a direct role in ZFP36L1 mRNA handling.

      We agree that a rescue experiment with wild-type and helicase-dead UPF1 in UPF1-deficient primary T cells would be interesting. Unfortunately, however, UPF1 knockout T cells are less viable and divide less (Supp Figure 6B), making further manipulations such as re-expression by viral transduction technically impossible. We will clarify this limitation in the Discussion and will more explicitly indicate that UPF1 promotes ZFP36L1 mRNA and protein expression, while acknowledging that the precise mechanistic contribution of UPF1 (e.g. to transcript processing, export, or surveillance) remain to be fully resolved.

      3) The inference that ZFP36L1 proximity to decapping and deadenylation complexes reflects pathway engagement is reasonable and, frankly, expected. Still, where the manuscript moves from proximity to function, the narrative works best when supported by orthogonal validation. Two compact additions would raise confidence without opening new lines of work. First, a small set of reciprocal co-IPs for PATL1 or DDX6 at endogenous levels in activated T cells, run with and without RNase, would tie the RNase-class assignments to biochemistry. Second, a short-pulse proximity experiment using a reduced biotin dose and shorter labeling window in activated cells would address whether long incubations drive non-specific labeling. Both are feasible in 2 to 3 weeks with minimal extra cost for antibodies and MS runs if the facility is in-house.

      We fully agree with the reviewer that orthogonal biochemical validation is valuable. Therefore, we already combined time-resolved proximity labeling (between 0-2h, 2-5h, and 5-16 hours) with time-resolved ZFP36L1 co-IPs ± RNase, to address the dynamic behavior and potential temporal broadening of the interactome.

      As to running reciprocal co-IPs for PATL1 or DDX6: we had in fact already considered to follow up on PATL1. However, we failed to identified specific antibodies, revealing many unspecific bands (see below). As to DDX6, antibodies suitable for IP have been reported, and we can therefore offer such reciprocal IP as requested.

      To further address the raised points, we will (i) clarify how we define and interpret RNase-sensitive versus RNase-resistant classes (ii) emphasize that some key factors (including PATL1) are already detected in shorter labeling conditions (2 h) in activated T cells (Fig 4C); and (iii) better highlight that the our data provide strong candidates and pathway hypotheses that warrant further mechanistic experimentation in follow-up studies, when moving from proximity to function.

      As to the suggested lowering dose of biotin: As described in Figure S1, this appeared unsuccessful. We owe it to the reported dependence and use of biotin in primary T cells (Ref’s 31-33 of this manuscript). This also included that we could not culture T cells in biotin-free medium prior to labeling, as most protocols would do in cell lines.

      The reviewer also suggested shorter labeling times. Please be advised that the labeling times chosen were based on the reported protein induction and activity on target mRNAs: 1) ZFP36L1 expression peaks at 2h of T cell activation (Zandhuis et al. 2025; 0.1002/eji.202451641, Petkau et al. 2024; 10.1002/eji.202350700), 3) shows the strongest effects on T cell function between 4-5h, and displays a late phase of activity at 5-16h (Popovic et al. Cell Reports 2023; 10.1016/j.celrep.2023.112419). We realize that additional explanation is warranted for this rationale, which we will provide.

      4) Reproducibility is helped by donor pooling, repeated T-cell screens, Jurkat confirmation, and detailed methods including MaxQuant, LIMMA, and supervised patterning. Deposition of MS data is listed. The authors should consider adding a brief, stand-alone analysis notebook in SI or on GitHub with exact filtering thresholds and "shape" definitions, since the supervised profiles are central to claims. This would let others reproduce figures from raw tables with the same code and workflows.

      We thank the reviewer for his or her suggestion and we have done as suggested. We will include the following link in the manuscript: https://github.com/ajhoogendijk/ZFP36L1_UltraID

      5) Replication and statistics are mostly adequate for discovery proteomics. The thresholds are clear, and PCA and correlation frameworks are appropriate. For functional readouts in edited T cells, please make the number of donors and independent experiments explicit in figure legends, and indicate whether statistics are paired by donor. Where viability differs (UPF1), note any gating strategies used to avoid bias in puromycin or activation marker measurements. These clarifications are quick to add.

      Please be advised that the current figure legends already contain the requested information at the bottom (which test used, donor number etc). To highlight this better, we will indicate this point more explicitly in the methods section.

      Minor comments 6) The UltraID optimization in primary T cells is useful, but the long 16-hour labeling and high biotin should be framed as a compromise rather than a standard. A short statement about potential off-target labeling during extended incubations would set expectations and justify the RNase and time-course controls.

      Please be advised that 1) high biotin was required because primary T cells depend on biotin and 2) increase biotin absorption a 2-7-fold upon activation (Ref 31-33 from the paper). For better time resolution, we included a labeling of 2h (from 0-2h of activation), 3h (from 2-5h) and 9h (from 5-16h) of T cell activation. Nevertheless, we agree that we cannot exclude the risk of off-target labeling, which in fact is inherent to any labeling and pulldown method. We will include such statement in the discussion.

      7) The overlap across T-cell screens and with HEK293T APEX datasets is discussed, but a compact quantitative reconciliation would help. A table that lists shared versus cell-type-specific interactors with brief notes on known expression patterns would make this point concrete.

      We thank the reviewer for this suggestion. We agree and we will include such table.

      8) Figures are generally clear. Where proximity and total proteome PCA are shown, consider adding sample-wise annotations for donor pools and activation time to help readers link variance to biology. Ensure all volcano plots and heatmaps display the exact cutoffs used in text.

      We agree that sample-wise annotations would be a nice addition. However, when testing this for e.g. FIgure 1D&E, such differentiation into individual donors becomes illegible due to the many different variables already present. We therefore decided against it.

      9) Prior work on ZFP36 family roles in decay, deadenylation via CCR4-NOT, granules, and translational control is cited within the manuscript. In a few places, recent proximity and interactome papers could be more explicitly integrated when comparing overlap, especially where conclusions differ by cell type. A concise paragraph in Discussion that lays out what is truly new in primary T cells would help clarify the contribution of this work to the field.

      We appreciate this suggestion and will revise the Discussion accordingly. As to what is new in primary T cells, we would also like to mention that adding H2O2 (required for APEX labeling) to T cells results in immediate cell death can therefore not be employed on T cells. This technical limitation further underscores the valuable contribution of the UltraID-based approach we present here.

      Reviewer #1 (Significance (Required)):

      Nature and type of advance. The study is a technical and contextual advance in mapping ZFP36L1 proximity partners directly in human primary T cells during activation. The combination of time-resolved labeling and RNase-class assignments is informative. The CRIS PR perturbations provide an initial functional bridge from proximity to phenotype, especially for UPF1.

      Context in the literature. ZFP36 family proteins have long been linked to ARE-mediated decay, CCR4-NOT recruitment, and granule localization. The present work confirms those cores and extends them to include decapping and GIGYF1/2-4EHP scaffolds in primary T cells with temporal resolution. The UPF1 link to ZFP36L1 expression adds a plausible surveillance angle that merits follow-up. The cell-type specificity analysis versus HEK293T underscores that proximity networks vary with context.

      Audience. Readers in RNA biology, T-cell biology, and proteomics will find the dataset valuable. Groups studying post-transcriptional regulation in immunity can use the resource to prioritize candidate nodes for mechanistic work.

      Expertise and scope. I work on post-transcriptional regulation, RNA-protein complexes, and T-cell effector biology. I am comfortable evaluating the conceptual claims, experimental design, and statistical treatment. I am not a mass spectrometry specialist, so I rely on the presented parameters and deposited data for MS acquisition specifics.

      To conclude, the manuscript delivers a substantive proximity map of ZFP36L1 in human T cells, with useful temporal and RNA-class information. The UPF1 observations are promising and would benefit from a compact rescue to secure causality. A few minor additions for biochemical validation and transparency in replication would further strengthen the paper.

      We thank the reviewer for this comprehensive and constructive assessment. We agree that our study primarily provides a substantive and well-annotated proximity map of ZFP36L1 in human T cells, including temporal and RNA-class information, and that the UPF1 observations constitute a promising lead that merits more detailed mechanistic analysis in follow-up studies.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)): The manuscript by Wolkers and colleagues describes the protein interactome of the RNA-binding protein ZFP36L1 in primary human T-cells. There is inherent value in the use of primary cells of human origin, but there is also value in that the study is quite complete, as it is performed in a variety of conditions: T-cells that have been activated or not, at different time points after activation, and by two methods (co-IP and proximity labeling). One might imagine that this basically covers all what can be detected for this protein in T-cells. The authors report a large amount of new interactors involved at all steps in post-transcriptional regulation. In addition, the authors show that UPF1, a known interactor of ZFP36L1, actually binds to ZFP36L1 mRNA and enhances its levels. In sum, the work provides a valuable resource of ZFP36L1 interactors. Yet, how the data add to the mechanistic understanding of ZFP36L1 functions and/or regulation of ZFP36L1 remains unclear.

      We thank the reviewer for this enthusiasm on our experimental setups, considering the use of primary T cells of inherent value and our study with the variety of conditions complete.

      Major comments: 1) Fig 2: It is confusing that the Pearson correlation to define ZFP36L1 interactors is changed depending on figure panel. In panels A-C, a correlation {greater than or equal to} 0.6 is used, while panel D uses a correlation > 0.5, which changes the nº of interactors. Then, this is changed again in Fig 3A for some cell types but not for others. Why has this been done? It would be better to stick to the same thresholds throughout the manuscript.

      Please be advised that different correlation thresholds arise from the composition of the individual datasets: they in depth, number of controls, and the overall dynamic range. The initial proximity labeling experiment (Figure 2A–C) had a higher depth and a larger number of suitable control samples, which allowed us to apply a stricter cutoff (r ≥ 0.6). The time-course experiment and some of the cross-cell-type comparisons have fewer controls and somewhat lower depth, which then required a more permissive threshold (e.g. r > 0.5) to retain known core interactors.

      We fully agree that this rationale needs to be explicit. In the revised manuscript we (i) clearly state for each dataset which correlation cutoff is used (ii) emphasize that these thresholds are somewhat arbitrary and should not be directly compared across experiments, and (iii) highlight that our key biological conclusions do not depend on the exact boundary chosen but rather on the consistent enrichment of core complexes and pathways across .

      2) Fig 3A: It would be nice to have the information of this Figure panel as a Table (protein name, molecular process(es), known or novel, previously detected in which cells) in addition to the figure.

      We agree that this would increase the value of our work as a resource to the community, and we will include such table and merge it with the table Reviewer 1 asked about.

      3) Fig 6: To what extent are the effects of UPF1 and GIGFYF1 knock-out on translation and T-cell hyper-activation mediated by ZFP36L1? If deletion of ZFP36L1 itself has no effect on these processes, it seems unlikely that it is involved. In this respect, I am not sure that Fig 6 contributes to the understanding of ZFP36L.

      We appreciate this conceptual question. In our dataset, ZFP36L1 knockout affects T-cell activation markers, but does not recapitulate the increased global translation observed upon UPF1 or GIGYF1/2 deletion. We will discuss this finding more explicitly in the Results and Discussion. We discuss the possibility that other ZFP36 family members (e.g. ZFP36/TTP, ZFP36L2) may partially compensate for the absence of ZFP36L1 in some readouts1. Moreover, we will emphasize that at this point it is not clear whether ZFP36L1’s contribution to UPF1 and GIGYF1 protein levels is direct or indirect.

      We nonetheless consider Fig. 6 an important component of the story, as it demonstrates that proximity partners emerging from the interactome (UPF1, GIGYF1/2) have measurable functional consequences on T cell activation and translational control, thereby illustrating how the resource can guide mechanistic hypotheses. We will now more carefully phrase this as “first indications of mechanism” and avoid implying that these phenotypes are mediated exclusively via ZFP36L1.

      4) Fig 7E: Differences in ZFP36L1 mRNA expression are claimed as a consequence of UPF1 deletion, and indeed there is a clear tendency to reduction of ZFP36L1 mRNA levels upon UPF1 KO. Yet the difference is statistically non-significant. Please, repeat this experiment to increase statistical significance. In addition, a clear discussion on how UPF1 -generally associated to mRNA degradation- contributes to increase ZFP36L1 mRNA levels would be appreciated.

      We would like to refrain from including repeats for increasing statistical power. We find similar trends with n=3 at 0h as with n=7 at 3h of activation (Fig. 7E). We rather would like to stress that despite the width overall expression levels which most probably stems from using primary human material, the overall levels of ZFP36L1 mRNA are lower in UPF1 KO T cells. We will include a point on how UPF1 possibly may contribute to the decreased ZFP36L1 mRNA levels, as suggested.

      5) Fig 6A: The decrease in global translation by GIGFYF1 knock-out upon activation claimed by the authors is not clear in Fig 6A and is non-significant upon quantification. Please, modify narrative accordingly.

      Indeed, this was not phrased well. We will correct our description to match the statistical analysis.

      6) Page 6: The authors state 'This included the PAN2/3 complex proteins which trim poly(A) tails prior to mRNA degradation through the CCR4/NOT complex'. To the best of my knowledge, the CCR4/NOT complex does not degrade the body of the mRNA. Both PAN2/3 and CCR4/NOT are deadenylases that function independently.

      We thank the reviewer for highlighting this inaccuracy. PAN2/3 and CCR4–NOT are indeed both deadenylase complexes that function independently rather than one acting strictly upstream of the other in degrading the mRNA body. We will correct this statement to that PAN2/3 and CCR4–NOT cooperate in poly(A) tail shortening and do not themselves degrade the mRNA body, which is instead handled by the downstream decay machinery.

      7) Please, label all Table sheets. Right now one has to guess what is being shown in most of them. Furthermore, it would be convenient to join all Tables related to the same Figure in one unique Excel with several sheets, rather than having many Tables with only one sheet each.

      We appreciate this suggestion. In the revised supplementary files all table sheets will be clearly labeled to indicate the corresponding figure and dataset, and combined into a single excel file when multiple tables relate to the same figure. We have already done so.

      Minor comments: 8) Fig 1E: Shouldn't there be a better separation by biotinylation in the UltraID IP principal component analysis? In theory, only biotinylated proteins should be immunoprecipitated.

      In theory this should indeed be the case. However, in practice, pull down experiments always suffer from background stickiness of proteins to tubes, beads etc. Combined, these known background issues highlight the critical addition of control samples, allowing for unequivocal call of proteins that are above background.

      In addition, as we indicated in the manuscript, primary T cells depend on Biotin. This prohibited us to use biotin-free medium, even for a short culture period (it resulted in cell death). Such biotin-free culture steps are included in proximity labeling assays performed in cell lines. Owing to the continuous addition of biotin, some of the ‘background’ biotinylation signal may even be ‘real’. Nevertheless, the higher levels of biotin we added during the labeling results in increased signals, and statistical analysis with these controls identifies which of the proteins are above background, irrespective from the source. We will include a short note on this in the manuscript

      9) Fig 3B-E: Is the labeling not swapped, top (always +) is Biotin and bottom (- or +) is aCD3/aCD28?

      We thank the reviewer for catching this mistake- we have corrected it

      10) Fig 7A data is from another paper, so I suggest to move this panel to Supplementary materials.

      We respectfully disagree. Please be advised that we reanalysed data from published datasets, that resulted in this figure. Re-analysis is a widely accepted method and certainly used for main figure panels. Our re-analysis from Bestenhorn et al 2025; (10.1016/j.molcel.2025.01.001) confirms that ZFP36L1 interacts with UPF1 and GIGYF1/2 in the RAW 264.7 macrophage cell line, which we consider an important consolidation of our findings. To highlight that this table is a re-analysis of published data, we will include this information (including the reference) below the data. As ‘extracted from Bestenhorn et al'

      11) Fig S1A: Why is there so much labeling in the UltraID only lane without biotin?

      This is a phenomenon also reported by others (Kubitz et al. 2022; 10.1038/s42003-022-03604-5: Figure 5A). UltraID alone is a small protein of (19.7KD), comparable to TurboID or others (Kubitz et al. 2022; 10.1038/s42003-022-03604-5). If not tethered to a specific compartment, these proximity labeling moieties can diffuse through the cytoplasm, biotinylating any protein they ‘bump’ into. Please be advised that we included this control to show this effect, to substantiate why we use GFP-UltraID- as control, to limit such background effects. To highlight this point better, we will better articulate this reasoning in the results section.

      12) Fig S1E: Please, explain better. What is WT?

      We thank the reviewer for catching this inconsistency. We will explicitly define “WT” as wild-type primary T cells (non-edited, non-transduced) and clarify how this relates to the other conditions.

      13) Fig S4B: Please, explain the labels on top of the shapes.

      We will update the figure, explaining how the labels above each shape are chosen (e.g. indicating specific clusters, functional categories, or experimental conditions, as appropriate). This should make the reading more intuitive.

      14) Page 3: A time-course of incubation with biotin is lacking in Fig S1B, and thereby it is confusing that the authors direct readers to this figure when an increased to 16h incubation is claimed to be better.

      Please be advised that short labeling times yielded disappointing results in primary human T cells. Therefore all first analyses were performed with 16h biotinylation, as depicted in Figure S1B). Only after achieving good results (presented in Figure 1B), we performed time course experiments (presented in __Figure 4, __lowering incubation times to 2h, 3h and 9h). We realize that this is confusing and we will rephrase this point in page 3.

      Reviewer #2 (Significance (Required)): Strengths: A thorough repository of ZFP36L1 interactors in primary human T-cells. A valuable resource for the community. Weaknesses: There is little mechanistic insight on ZFP36L1 function or regulation.

      We would like to highlight that the purpose of our study was to provide a comprehensive interactome of ZFP36L1, and to study the dynamics of these interactions. In addition to known interactors, we identified novel putative interactors of ZFP36L1. We have indeed not followed up on all interactions, which we consider beyond the scope of this manuscript. Rather, we consider our study as a toolbox for the community, that helps in their studies.

      Nevertheless, in Fig 6-7, we show first indications of mechanistic insights on ZFP36L1 interactors, exemplifying how the findings of this resource paper can be used by the community.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The authors have analyzed the interactome of ZFP36L1 in primary human T cells using a biotin-based proximity labeling method. In addition to proteins that are known to interact with ZFP36L1, the authors defined a multitude of novel interactions involved in mRNA decapping, mRNA degradation pathways, translation repressors, stress granule/p-body formation, and other regulatory pathways. Time-lapse proximity labeling revealed that the ZFP36L1 interactome undergoes remodeling during T cell activation. Co-IP for ZFP36L1 executed in the presence/absence of RNA further revealed the interactome and possible regulators of ZFP36L1, including the helicase UPF1. In addition to interacting with ZFP36L1, UPF1 promotes the ZFP36L1 protein expression, seemingly by binding to the ZFP36L1 mRNA transcript, and in some way stabilizing it. This comprehensive interactome map highlights the widespread interactions of ZFP36L1 with proteins of many types, and its potential roles in diverse T cell processes. Although somewhat descriptive, rather than hypothesis-testing, this work represents an important contribution to understanding the potential roles of the ZFP36 family proteins, and sets up many future experiments which could test molecular details.

      We thank the reviewer for these thoughtful points, and for recognizing our paper as an important contribution for the field as resource, that should support future experiments.

      Major points: 1) Can the authors discuss the specificity of the antibody for ZFP36L1 used in the Co-IP experiments? The antibody listed in Appendix A is abcam catalog number ab42473, although the catalog number for this antibody (unlike the others major ones used) is not listed in the Methods section - please add this to the Methods to make it easier for readers to find this detail. Could this antibody also be immunoprecipitating ZFP36 or ZFP36L2? Other antibodies have had cross-reactivity for the different family members. It is also notable that this antibody has been discontinued by the manufacturer (https://www.abcam.com/en-us/products/unavailable/zfp36l1-antibody-ab42473). Have the authors tried the current abcam anti-ZFP36L1 antibody being sold, catalog number ab230507?

      We appreciate the opportunity to clarify this important technical point. We have now added the catalog number (ab42473, Abcam) of the anti-ZFP36L1 antibody used for co-IP to the Methods section, in addition to Appendix A, to facilitate reproducibility. The antibody ab42473 has indeed been discontinued by the manufacturer. We have contacted the manufacturer on multiple occasions with no luck.

      We have evaluated multiple alternative anti-ZFP36L1 antibodies, including the currently available Abcam antibody ab230507. In our hands, these alternatives showed weaker or less specific detection of ZFP36L1 compared to the original ZFP36L1 antibody. Only antibody 1A3 recognized ZFP36L1. We therefore used this antibody for the Co-IP. Importantly, even though the signal is lower than the original antibody we used, the migration patterns observed with ab42473 in our co-IP experiments match the expected molecular weight of ZFP36L1 and do not suggest substantial cross-reactivity with ZFP36 or ZFP36L2, which display distinct sizes (we will add the sizes to the WB in figures). We discuss this point briefly in the revised Methods/Results.

      2) On this point, the authors report interactions between ZFP36L1 and its related proteins ZFP36 and ZFP36L2 in the Co-IP experiment (Supp 5C). Did these proteins interact in the proximity labeling? Ideally this could be discussed in the Discussion section.

      ZFP36 and ZFP36L2 were indeed detected as co-precipitating with ZFP36L1 in the co-IP experiments but were not found as high-confidence interactors in the UltraID proximity labeling datasets. Also in the APEX proximity labeling of Bestehorn et al. In RAW macrophage cells, they did not find ZFP36 or ZFP36L1 to interact with ZFP36L1. * *We now explicitly mention this in the Results and discuss it in the Discussion.

      3) Can the authors discuss more fully the limited overlap in identified interactors across the two proximity labeling screens performed in primary T cells (Fig 2C)? Likewise, can the authors comment on the very limited overlap between the screens in T cells and the published ZFP36L1-APEX proximity labelling experiment performed in the HEK293T cell line by Bestehorn et al. (ref 42)? Only 6.8% of proteins found in either T cell screen were found as interactors in this cell line. The authors comment that this may be because "...either expression of certain proteins is cell-type specific, or [because] ZFP36L1 has cell-type specific protein interactions, in addition to its core interactome". While I agree that cell-type specific interactions may be at play, I would think most of the interactors found in the T cell screens are widely expressed proteins necessary for central cell functions.

      First, the apparent overlap percentage depends on depth and filtering. As noted above and now detailed in a new Supplementary table, a core set of decapping, deadenylation, and granule-associated factors is consistently recovered across our T-cell screens and the HEK293T APEX dataset. However, beyond this core protein, overlap is reduced, reflecting several factors: (i) differences in expression levels of many interactors between HEK293T cells and primary T cells; (ii) the activation-dependent nature of ZFP36L1 function in T cells, which cannot be fully mimicked in HEK293T; (iii) different proximity labeling enzymes and fusion constructs (APEX vs UltraID, different tags, expression levels); and (iv) distinct experimental designs and control strategies, which influence statistical filtering and the effective “depth” of each interactome.

      In the revised Discussion and in the new comparative table, we now emphasize that while many of the ZFP36L1 proximity partners identified in T cells are indeed widely expressed, their effective labeling and enrichment are strongly context dependent. We therefore interpret the relatively limited overlap as highlighting both a robust core interactome and substantial context-specific remodeling, rather than as evidence of artifacts in one or the other dataset.


      Minor comments: 4) In Figure 3D, the legend states that black circles indicate significantly enriched proteins in biotin samples, while grey circles indicate non-significant enrichment. However, some genes, including DCP1A, DDX6, YBX1, have black circles in the -biotin group and grey in the +biotin group, which creates confusion in interpretation.

      We thank the reviewer for this comment. We have accidentally switched the labeling of biotin and activation as pointed out by reviewer 2. Once this is fixed, this comment will also be fixed.

      5) Did the authors find any interactors whose expression is known to be specific to CD4 or CD8 T cells?

      In our current dataset we did not identify interactors whose presence was clearly restricted to CD4 or CD8 T-cells. We agree that differential ZFP36L1 interactomes in defined T-cell subsets represent an interesting avenue for future targeted studies and will outline this is the discussion.

      Reviewer #3 (Significance (Required)):

      The authors present the first comprehensive analysis of the ZFP36L1 interactome in primary T cells. The use of biotin-based proximity labeling enables detection of physiologically relevant interactions in live cells. This approach revealed many novel interactors.

      Strengths include the overall richness of the dataset, and the hypothesis-provoking experiments that could follow in the future. Limitations include somewhat limited overlap with a published proximity labeling dataset from performed in a different cell line, suggesting that there may be artifacts in one or both datasets.

      The audience for this article would include those interested broadly in RNA binding proteins and those interested in post-transcriptional and translational regulation.

      I have immunology expertise on T cell activation and differentiation and expertise on transcriptional and post-transcriptional regulation of gene expression in T cells.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. In this round, the authors provided some clarity, but some questions still remain, and I remain unconvinced by a main assumption that was not addressed.

      Based on the authors' response, if I understand the life history correctly, dispersers either immediately join another group (with 1-the probability of dispersing), or remain floaters until they successfully compete for a breeder spot or die? Is that correct? I honestly cannot decide because this seems implicit in the first response but the response to my second point raises the possibility of not working while floating but can work if they later join a group as a subordinate. If it is the case that floaters can have multiple opportunities to join groups as subordinates (not as breeders; I assume that this is the case for breeding competition), this should be stated, and more details about how. So there is still some clarification to be done, and more to the point, the clarification that happened only happened in the response. The authors should add these details to the main text. Currently, the main text only says vaguely that joining a group after dispersing " is also controlled by the same genetic dispersal predisposition" without saying how.

      In each breeding cycle, individuals have the opportunity to become a breeder, a helper, or a floater. Social role is really just a state, and that state can change in each breeding cycle (see Figure 1). Therefore, floaters may join a group as subordinates at any point in time depending on their dispersal propensity, and subordinates may also disperse from their natal group any given time. In the “Dominance-dependent dispersal propensities” section in the SI, this dispersal or philopatric tendency varies with dominance rank.

      We have added: “In each breeding cycle” (L415) to clarify this further.

      In response to my query about the reasonableness of the assumption that floaters are in better condition (in the KS treatment) because they don't do any work, the authors have done some additional modeling but I fail to see how that addresses my point. The additional simulations do not touch the feature I was commenting on, and arguably make it stronger (since assuming a positive beta_r -which btw is listed as 0 in Table 1- would make floaters on average be even more stronger than subordinates). It also again confuses me with regard to the previous point, since it implies that now dispersal is also potentially a lifetime event. Is that true?

      We are not quite sure where the reviewer gets this idea because we have never assumed a competitive advantage of floaters versus helpers. As stated in the previous revision, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5 in Figure 1) if subordinates are engaged in work tasks. However, floaters also have higher mortality rates than group members, which makes them have lower age averages. In addition, helpers have the advantage of always competing for an open breeding position in the group, while floaters do not have this preferential access (in Figure S2 we reduce even further the likelihood of a floater to try to compete for a breeding position).

      Moreover, in the previous revision (section: “Dominance-dependent dispersal propensities” in the SI) we specifically addressed this concern by adding the possibility that individuals, either floaters or subordinate group members, react to their rank or dominance value to decide whether to disperse (if subordinate) or join a group (if floater). Hence, individuals may choose to disperse when low ranked and then remain on the territory they dispersed to as helpers, OR they may remain as helpers in their natal territory as low ranked individuals and then disperse later when they attain a higher dominance value. The new implementation, therefore, allows individuals to choose when to become floaters or helpers depending on their dominance value. This change to the model affects the relative competitiveness between floaters and helpers, which avoids the assumption that either low- or high-quality individuals are the dispersing phenotype and, instead, allows rank-based dispersal as an emergent trait. As shown in Figure S5, this change had no qualitative impact on the results.

      To make this all clearer, we have now added to all of the relevant SI tables a new row with the relative rank of helpers vs floaters. As shown, floaters do not consistently outrank helpers. Rather, which role is most dominant depends on the environment and fitness trade-offs that shape their dispersing and helping decisions.

      Some further clarifications: beta_r is a gene that may evolve either positive or negative values, 0 (no reaction norm of dispersal to dominance rank) is the initial value in the simulations before evolution takes place. Therefore, this value may evolve to positive or negative values depending on evolutionary trade-offs. Also, and as clarified in the previous comment, the decision to disperse or not occurs at each breeding cycle, so becoming a floater, for example, is not a lifetime event unless they evolve a fixed strategy (dispersal = 0 or 1). 

      Meanwhile, the simplest and most convincing robustness check, which I had suggested last round, is not done: simply reduce the increase in the R of the floater by age relative to subordinates. I suspect this will actually change the results. It seems fairly transparent to me that an average floater in the KS scenario will have R about 15-20% higher than the subordinates (given no defense evolves, y_h=0.1 and H_work evolves to be around 5, and the average lifespan for both floaters and subordinates are in the range of 3.7-2.5 roughly, depending on m). That could be a substantial advantage in competition for breeding spots, depending on how that scramble competition actually works. I asked about this function in the last round (how non-linear is it?) but the authors seem to have neglected to answer.

      As we mentioned in the previous comment above, we have now added the relative rank between helpers and floaters to all the relevant SI tables, to provide a better idea of the relative competitiveness of residents versus dispersers for each parameter combination. As seen in Table S1, the competitive advantage of floaters is only marginally in the favor for floaters in the “Only kin selection” implementation. This advantage only becomes more pronounced when individuals can choose whether to disperse or remain philopatric depending on their rank. In this case, the difference in rank between helpers and floaters is driven by the high levels of dispersal, with only a few newborns (low rank) remaining briefly in the natal territory (Table S6). Instead, the high dispersal rates observed under the “Only kin selection” scenario appear to result from the low incentives to remain in the group when direct fitness benefits are absent, unless indirect fitness benefits are substantially increased. This effect is reinforced by the need for task partitioning to occur in an all-or-nothing manner (see the new implementation added to the “Kin selection and the evolution of division of labor” in the Supplementary materials; more details in following comments).

      In addition, we specifically chose not to impose this constraint of forcing floaters to be lower rank than helpers because doing so would require strong assumptions on how the floaters rank is determined. These assumptions are unlikely to be universally valid across natural populations (and probably not commonly met in most species) and could vary considerably among species. Therefore, it would add complexity to the model while reducing generalizability.

      As stated in the previous revision, no scramble competition takes place, this was an implementation not included in the final version of the manuscript in which age did not have an influence in dominance. Results were equivalent and we decided to remove it for simplicity prior to the original submission, as the model is already very complex in the current stage; we simply forgot to remove it from Table 1, something we explained in the previous round of revisions.

      More generally, I find that the assumption (and it is an assumption) floaters are better off than subordinates in a territory to be still questionable. There is no attempt to justify this with any data, and any data I can find points the other way (though typically they compare breeders and floaters, e.g.: https://bioone.org/journals/ardeola/volume-63/issue-1/arla.63.1.2016.rp3/The-Unknown-Life-of-Floaters--The-Hidden-Face-of/10.13157/arla.63.1.2016.rp3.full concludes "the current preliminary consensus is that floaters are 'making the best of a bad job'."). I think if the authors really want to assume that floaters have higher dominance than subordinates, they should justify it. This is driving at least one and possibly most of the key results, since it affects the reproductive value of subordinates (and therefore the costs of helping).

      We explicitly addressed this in the previous revision in a long response about resource holding potential (RHP). Once again, we do NOT assume that dispersers are at a competitive advantage to anyone else. Floaters lack access to a territory unless they either disperse into an established group or colonize an unoccupied territory. Therefore, floaters endure higher mortalities due to the lack of access to territories and group living benefits in the model, and are not always able to try to compete for a breeding position.

      The literature reports mixed evidence regarding the quality of dispersing individuals, with some studies identifying them as low-quality and others as high-quality, attributing this to them experiencing fewer constraints when dispersing that their counterparts (e.g. Stiver et al. 2007 Molecular Ecology; Torrents‐Ticó, et al. 2018 Journal of Zoology). Additionally, dispersal can provide end-of-queue individuals in their natal group an opportunity to join a queue elsewhere that offers better prospects, outcompeting current group members (Nelson‐Flower et al. 2018 Journal of Animal Ecology). Moreover, in our model floaters do not consistently have lower dominance values or ranks than helpers, and dominance value is often only marginally different.

      In short, we previously addressed the concern regarding the relative competitiveness of floaters compared to subordinate group members. To further clarify this point here, we have now included additional data on relative rank in all of the relevant SI tables. We hope that these additions will help alleviate any remaining concerns on this matter.

      Regarding division of labor, I think I was not clear so will try again. The authors assume that the group reproduction is 1+H_total/(1+H_total), where H_total is the sum of all the defense and work help, but with the proviso that if one of the totals is higher than "H_max", the average of the two totals (plus k_m, but that's set to a low value, so we can ignore it), it is replaced by that. That means, for example, if total "work" help is 10 and "defense" help is 0, total help is given by 5 (well, 5.1 but will ignore k_m). That's what I meant by "marginal benefit of help is only reduced by a half" last round, since in this scenario, adding 1 to work help would make total help go to 5.5 vs. adding 1 to defense help which would make it go to 6. That is a pretty weak form of modeling "both types of tasks are necessary to successfully produce offspring" as the newly added passage says (which I agree with), since if you were getting no defense by a lot of food, adding more food should plausibly have no effect on your production whatsoever (not just half of adding a little defense). This probably explains why often the "division of labor" condition isn't that different than the no DoL condition.

      The model incorporates division of labor as the optimal strategy for maximizing breeder productivity, while penalizing helping efforts that are limited to either work or defense alone. Because the model does not intend to force the evolution of help as an obligatory trait (breeders may still reproduce in the absence of help; k<sub>0</sub> ≠ 0), we assume that the performance of both types of task by the helpers is a non-obligatory trait that complements parental care.

      That said, we recognize the reviewer’s concern that the selective forces modeled for division of labor might not be sufficient in the current simulations. To address this, we have now introduced a new implementation, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. In this implementation, division of labor becomes obligatory for breeders to gain a productivity boost from the help of subordinate group members. The new implementation tests whether division of labor can arise solely from kin selection benefits. Under these premises, philopatry and division of labor do emerge through kin selection, but only when there is a tenfold increase in productivity per unit of help compared to the default implementation. Thus, even if such increases are biologically plausible, they are more likely to reflect the magnitudes characteristic of eusocial insects rather than of cooperatively breeding vertebrates (the primary focus of this model). Such extreme requirements for productivity gains and need for coordination further suggest that group augmentation, and not kin selection, is probably the primary driving force particularly in harsh environments. This is now discussed in L210-213.

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. The model considers a population subdivided in groups, each group has a single asexually-reproducing breeder, other group members (subordinates) can perform two types of tasks called "work" or "defense", individuals have different ages, individuals can disperse between groups, each individual has a dominance rank that increases with age, and upon death of the breeder a new breeder is chosen among group members depending on their dominance. "Workers" pay a reproduction cost by having their dominance decreased, and "defenders" pay a survival cost. Every group member receives a survival benefit with increasing group size. There are 6 genetic traits, each controlled by a single locus, that control propensities to help and disperse, and how task choice and dispersal relate to dominance. To study the effect of group augmentation without kin selection, the authors cross-foster individuals to eliminate relatedness. The paper allows for the evolution of the 6 genetic traits under some different parameter values to study the conditions under which division of labour evolves, defined as the occurrence of different subordinates performing "work" and "defense" tasks. The authors envision the model as one of vertebrate division of labor.

      The main conclusion of the paper is that group augmentation is the primary factor causing the evolution of vertebrate division of labor, rather than kin selection. This conclusion is drawn because, for the parameter values considered, when the benefit of group augmentation is set to zero, no division of labor evolves and all subordinates perform "work" tasks but no "defense" tasks.

      Strengths:

      The model incorporates various biologically realistic details, including the possibility to evolve age polytheism where individuals switch from "work" to "defence" tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model and its analysis is limited, which makes the results insufficient to reach the main conclusion that group augmentation and not kin selection is the primary cause of the evolution of vertebrate division of labor. There are several reasons.

      First, the model strongly restricts the possibility that kin selection is relevant. The two tasks considered essentially differ only by whether they are costly for reproduction or survival. "Work" tasks are those costly for reproduction and "defense" tasks are those costly for survival. The two tasks provide the same benefits for reproduction (eqs. 4, 5) and survival (through group augmentation, eq. 3.1). So, whether one, the other, or both tasks evolve presumably only depends on which task is less costly, not really on which benefits it provides. As the two tasks give the same benefits, there is no possibility that the two tasks act synergistically, where performing one task increases a benefit (e.g., increasing someone's survival) that is going to be compounded by someone else performing the other task (e.g., increasing that someone's reproduction). So, there is very little scope for kin selection to cause the evolution of labour in this model. Note synergy between tasks is not something unusual in division of labour models, but is in fact a basic element in them, so excluding it from the start in the model and then making general claims about division of labour is unwarranted. I made this same point in my first review, although phrased differently, but it was left unaddressed.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers, in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care), as we stated in the previous review. Therefore, in this context, helpers may only obtain fitness benefits directly or indirectly by increasing the productivity of the breeders. This benefit is maximized when division of labor occurs between group members as there is a higher return for the least amount of effort per capita. Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. This is not to suggest that the model does not favor synergy, as engaging in two distinct tasks enhances the breeders' productivity more than if group members were to perform only one type of alloparental care task. We have expanded on the need for division of labor by making the performance of each type of task a requirement to boost the breeders productivity, see more details in a following comment.

      Second, the parameter space is very little explored. This is generally an issue when trying to make general claims from an individual-based model where only a very narrow parameter region has been explored of a necessarily particular model. However, in this paper, the issue is more evident. As in this model the two tasks ultimately only differ by their costs, the parameter values specifying their costs should be varied to determine their effects. Instead, the model sets a very low survival cost for work (yh=0.1) and a very high survival cost for defense (xh=3), the latter of which can be compensated by the benefit of group augmentation (xn=3). Some very limited variation of xh and xn is explored, always for very high values, effectively making defense unevolvable except if there is group augmentation. Hence, as I stated in my previous review, a more extensive parameter exploration addressing this should be included, but this has not been done. Consequently, the main conclusion that "division of labor" needs group augmentation is essentially enforced by the limited parameter exploration, in addition to the first reason above.

      We systematically explored the parameter landscape and report in the body of the paper only those ranges that lead to changes in the reaction norms of interest (other ranges are explored in the SI). When looking into the relative magnitude of cost of work and defense tasks, it is important to note that cost values are not directly comparable because they affect different traits. However, the ranges of values capture changes in the reaction norms that lead to rank-depending task specialization.

      To illustrate this more clearly, we have added a new section in the SI (Variation in the cost of work tasks instead of defense tasks section) showing variation in y<sub>h</sub>, which highlights how individuals trade off the relative costs of different tasks. As shown, the results remain consistent with everything we showed previously: a higher cost of work (high y<sub>h</sub>) shifts investment toward defense tasks, while a higher cost of defense (high x<sub>h</sub>) shifts investment toward work tasks.

      Importantly, additional parameter values were already included in the SI of the previous revision, specifically to favor the evolution of division of labor under only kin selection. Basically, division of labor under only kin selection does happen, but only under conditions that are very restrictive, as discussed in the “Kin selection and the evolution of division of labor” section in the SI. We have tried to make this point clearer now (see comments to previous reviewer above, and to this reviewer right below).

      Third, what is called "division of labor" here is an overinterpretation. When the two tasks evolve, what exists in the model is some individuals that do reproduction-costly tasks (so-called "work") and survival-costly tasks (so-called "defense"). However, there are really no two tasks that are being completed, in the sense that completing both tasks (e.g., work and defense) is not necessary to achieve a goal (e.g., reproduction). In this model there is only one task (reproduction, equation 4,5) to which both "tasks" contribute equally and so one task doesn't need to be completed if the other task compensates for it. So, this model does not actually consider division of labor.

      Although it is true that we did not make the evolution of help obligatory and, therefore, did not impose division of labor by definition, the assumptions of the model nonetheless create conditions that favor the emergence of division of labor. This is evident when comparing the equilibria between scenarios where division of labor was favored versus not favored (Figure 2 triangles vs circles).

      That said, we acknowledge the reviewer’s concern that the selective forces modeled in our simulations may not, on their own, be sufficient to drive the evolution of division of labor under only kin selection. Therefore, we have now added a section where we restrict the evolution of help to instances in which division of labor is necessary to have an impact on the dominant breeder productivity. Under this scenario, we do find division of labor (as well as philopatry) evolving under only kin selection. However, this behavior only evolves when help highly increases the breeders’ productivity (by a factor of 10 what is needed for the evolution of division of labor under group augmentation). Therefore, group augmentation still appears to be the primary driver of division of labor, while kin selection facilitates it and may, under certain restrictive circumstances, also promote division of labor independently (discussed in L210-213).

      Reviewer #1 (Recommendations for the authors):

      I really think you should do the simulations where floaters do not come out ahead by floating. That will likely change the result, but if it doesn't, you will have a more robust finding. If it does, then you will have understood the problem better.

      As we outlined in the previous round of revisions, implementing this change would be challenging without substantially increasing model complexity and reducing its general applicability, as it would require strong assumptions that could heavily influence dispersal decisions. For instance, by how much should helpers outcompete floaters? Would a floater be less competitive than a helper regardless of age, or only if age is equal? If competitiveness depends on equal age, what is the impact of performing work tasks given that workers always outcompete immigrants? Conversely, if floaters are less competitive regardless of age, is it realistic that a young individual would outcompete all immigrants? If a disperser finds a group immediately after dispersal versus floating for a while, is the dominance value reduced less (as would happen to individuals doing prospections before dispersal)? 

      Clearly it is not as simple as the referee suggests because there are many scenarios that would need to be considered and many assumptions made in doing this. As we explained to the points above, we think our treatment of floaters is consistent with the definition of floaters in the literature, and our model takes a general approach without making too many assumptions.

      Reviewer #2 (Recommendations for the authors):

      The paper's presentation is still unclear. A few instances include the following. It is unclear what is plotted in the vertical axes of Figure 2, which is T but T is a function of age t, so this T is presumably being plotted at a specific t but which one it is not said.

      The values graphed are the averages of the phenotypically expressed tasks, not the reaction norms per se. We have now rewritten the the axis to “Expressed task allocation T (0 = work, 1 = defense)” to increase clarity across the manuscript.

      The section titled "The need for division of labor" in the methods is still very unclear.

      We have rephased this whole section to improve clarity.

    1. AbstractIdentifying differentially expressed genes associated with genetic pathologies is crucial to understanding the biological differences between healthy and diseased states and identifying potential biomarkers and therapeutic targets. However, gene expression profiles are controlled by various mechanisms including epigenomic changes, such as DNA methylation, histone modifications, and interfering microRNA silencing.We developed a novel Shiny application for transcriptomic and epigenomic change identification and correlation using a combination of Bioconductor and CRAN packages.The developed package, named EMImR, is a user-friendly tool with an easy-to-use graphical user interface to identify differentially expressed genes, differentially methylated genes, and differentially expressed interfering miRNA. In addition, it identifies the correlation between transcriptomic and epigenomic modifications and performs the ontology analysis of genes of interest.The developed tool could be used to study the regulatory effects of epigenetic factors. The application is publicly available in the GitHub repository (https://github.com/omicscodeathon/emimr).

      This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.168), and has published the reviews under the same license.

      Reviewer 1. Haikuo Li

      Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. Should be made more clear.

      Comments: The authors developed EMImR as an R toolkit and open-sourced software for analysis of bulk RNA-seq as well as epigenomic sequencing data including DNA methylation seq and non-coding RNA profiling. This work is very interesting and should be of interest to people interested in transcriptomic and epigenomic data analysis but without computational background. I have two major comments: 1. Results presented in this manuscript were only from microarray datasets and are kind of “old” data. Although these data types and sequencing platforms are still very valuable, I don’t think they are widely used as of today, and therefore, it may be less compelling to the audience. It is suggested to validate EMImR using additional more recently published datasets. 2. The authors studied bulk transcriptomic and epigenomic sequencing data. In fact, single-cell and spatially resolved profiling of these modalities are becoming the mainstream of biomedical research since those methods offer much better resolution and biological insights. The authors are encouraged to discuss some key references of this field (for example, PMIDs: 34062119 and 38513647 for single-cell multiomics; PMID: 40119005 for spatial multiomics sequencing), potentially as the future direction of package development. Re-review: The authors have answered my questions and added new content in the Discussion section as suggested.

      Reviewer 2. Weiming He

      Dear Editor-in-Chief, The EMImR developed by the author is a Shiny application designed for the identification of transcriptomic and epigenomic changes and data association. This program is mainly targeted at Windows UI users who do not possess extensive computational skills. Its core function is to identify the intersections between genetic and epigenetic modifications

      Review Recommendation I recommend that after making appropriate revisions to the current “Minor Revision”, the article can be accepted. However, the author needs to address the following issues.

      Major Issue The article does not provide specific information on the resource consumption (memory and time) of the program. This is crucial for new users. Although we assume that the resource consumption is minimal, users need to know the machine configuration required to run the program. Therefore, I suggest adding two columns for “Time” and “Memory” in Table 1.

      Minor Issues 1. GitHub Page The Table of Contents on the GitHub page provides a Demonstration Video. However, due to restricted access to YouTube in some regions, it is recommended to also upload a manual in PDF format named “EMImR_manual.pdf” on GitHub. In step 4 of the Installation Guide, it states that “All dependencies will be installed automaticly”. It is advisable to add a step: if the installation fails, prompt the user about the specific error location and guide the user to install the dependent packages manually first to ensure successful installation. Currently, the command “source(‘Dependencies_emimr.R’)” does not return any error messages, which is extremely inconvenient for novice users. The author can provide the maintainer's email address so that users can seek timely solutions when encountering problems

      1. R Version The author recommends using R - 4.2.1 (2022), which was released three years ago. The current latest version is R 4.5.1. It is suggested that the author test the program with the latest version to ensure its adaptability to future developments.

      2. Flowchart Suggestion It is recommended to add a flowchart to illustrate the sequential relationships among packages such as DESeq2 for differential analysis, clusterProfiler for clustering, enrichplot for plotting, and miRNA - related packages (this is optional).

      4.Function Addition Currently, the program seems to lack a button for saving PDFs, as well as functions for batch uploading, saving sessions, and one - click exporting of PDF/PNG files. It is recommended to add the “shinysaver” and “downloadHandler” functions to fulfill these requirements.

      1. Personalized Features and Upgrade Plan To attract more users, more personalized features should be added. The author can mention the future upgrade plan in the discussion section. For example, currently, DESeq2 is used for differential analysis, and in future upgrades, more methods such as PossionDis, NOIseq, and EBseq could be provided for users to choose from.

      2. Text Polishing Suggestions 6.1 Unify the usage of “down - regulated” and “downregulated”, preferably using the latter. 6.2 “R - studio version” ---》 “RStudio” 6.3 Lumian, ---》 Lumian 6.4 no login wall ---》 does not require user registration 6.5 Rewrite “genes were simultaneously differentially expressed and methylated” as “genes that were both differentially expressed and differentially methylated”. 6.6 Ensure that Latin names of species are in italics 6.7 make corresponding modifications to other sentences to improve the accuracy and professionalism of the language in the article.

      The above are my detailed review comments on this article. I hope they can provide a reference for your decision - making.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC -2025-03175

      Corresponding author(s): Gernot Längst

      [Please use this template only if the submitted manuscript should be considered by the affiliate journal as a full revision in response to the points raised by the reviewers.

      • *

      If you wish to submit a preliminary revision with a revision plan, please use our "Revision Plan" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      This section is optional. Insert here any general statements you wish to make about the goal of the study or about the reviews.

      2. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      We thank the reviewers for their efforts and detailed evaluation of our manuscript. We think that the comments of the reviewers allowed us to significantly improve the manuscript.

      With best regards

      The authors of the manuscript

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary: Holzinger et al. present a new automated pipeline, nucDetective, designed to provide accurate nucleosome positioning, fuzziness, and regularity from MNase-seq data. The pipeline is structured around two main workflows-Profiler and Inspector-and can also be applied to time-series datasets. To demonstrate its utility, the authors re-analyzed a Plasmodium falciparum MNase-seq time-series dataset (Kensche et al., 2016), aiming to show that nucDetective can reliably characterize nucleosomes in challenging AT-rich genomes. By integrating additional datasets (ATAC-seq, RNA-seq, ChIP-seq), they argue that the nucleosome positioning results from their pipeline have biological relevance.

      Major Comments:

      Despite being a useful pipeline, the authors draw conclusions directly from the pipeline's output without integrating necessary quality controls. Some claims either contradict existing literature or rely on misinterpretation or insufficient statistical support. In some instances, the pipeline output does not align with known aspects of Plasmodium biology. I outline below the key concerns and suggested improvements to strengthen the manuscript and validate the pipeline:

      Clarification of +1 Nucleosome Positioning in P. falciparum The authors should acknowledge that +1 nucleosomes have been previously reported in P. falciparum. For example, Kensche et al. (2016) used MNase-seq to map ~2,278 TSSs (based on enriched 5′-end RNA data) and found that the +1 nucleosome is positioned directly over the TSS in most genes:

      "Analysis of 2278 start sites uncovered positioning of a +1 nucleosome right over the TSS in almost all analysed regions" (Figure 3A).

      They also described a nucleosome-depleted region (NDR) upstream of the TSS, which varies in size, while the +1 nucleosome frequently overlaps the TSS. The authors should nuance their claims accordingly. Nevertheless, I do agree that the +1 positioning in P. falciparum may be fuzzier as compared to yeast or mammals. Moreover, the correlation between +1 nucleosome occupancy and gene expression is often weak and that several genes show similar nucleosome profiles regardless of expression level. This raises my question: did the authors observe any of these patterns in their new data?

      We appreciate the reviewer’s insightful comment and agree that +1 nucleosomes and nucleosome depleted promoter regions have been previously reported in P. falciparum, notably by the Bartfai and Le Roch groups, including Kensche et al. (PMID: 26578577). Our study advances this understanding by providing, for the first time, a comprehensive view of the entirety of a canonical eukaryotic promoter architecture in P. falciparum—encompassing the NDR, the well-positioned +1 nucleosome, and the downstream phased nucleosome array. This downstream nucleosome array structure has not been characterized before, as prior studies noted a “lack of downstream nucleosomal arrays” (PMID: 26578577) or “relatively random” nucleosome organization within gene bodies (PMID: 24885191). We have revised the manuscript to more clearly acknowledge previous work and highlight our contributions. The changes we applied in the manuscript are highlighted in yellow and shown as well below.

      In the Abstract L26-L230: Contrary to the current view of irregular chromatin, we demonstrate for the first time regular phased nucleosome arrays downstream of TSSs, which, together with the established +1 nucleosome and upstream nucleosome-depleted region, reveal a complete canonical eukaryotic promoter architecture in Pf.

      Introduction L156-L159: For example, we identify a phased nucleosome array downstream of the TSS. Together with a well-positioned +1 nucleosome and an upstream nucleosome-free region. These findings support a promoter architecture in Pf that resembles classical eukaryotic promoters (Bunnik et al. 2014, Kensche et al. 2016).

      Results L181-L183: These new Pf nucleosome maps reveal a nucleosome organisation at transcription start sites (TSS) reminiscent of the general eukaryotic chromatin structure, featuring a reported well-positioned +1 nucleosome , an upstream nucleosome-free region (NFR, Bunnik et al. 2014, Kensche et al. 2016), and shown for the first time in Pf, a phased nucleosome array downstream of the TSS.

      Discussion L414-L419: Previous analyses of Pf chromatin have identified +1 nucleosomes and NFRs (Bunnik et al 2014, Kensche et al. 2016). Here we extend this understanding by demonstrating phased nucleosome array structures throughout the genome. This finding provides evidence for a spatial regulation of nucleosome positioning in Pf, challenging the notion that nucleosome positioning is relatively random in gene bodies (Bunnik et al. 2014, Kensche et al. 2016). Consequently our results contribute to the understanding that Pf exhibits a typical eukaryotic chromatin structure, including well-defined nucleosome positioning at the TSS and regularly spaced nucleosome arrays (Schones et al. 2008; Yuan et al. 2005).

      Regarding the reviewer’s question on +1 nucleosome dynamics. Our data agrees with the reviewer and other studies (e.g. PMID: 31694866), that the +1 nucleosome position is robust and does not correlate with gene expression strength. In the manuscript we show that dynamic nucleosomes are preferentially detected at the –1 nucleosome position (Figure 2C). In line with that we show that the +1 nucleosome position does not markedly change during transcription initiation of a subset of late transcribed genes (Figure 5A). However, we observe an opening of the NDR and within the gene body increased fuzziness and decreased nucleosome array regularity (Figure S4A). To illustrate the relationship between the +1 nucleosome positioning and expression strength, we have included a heatmap showing nucleosome occupancy at the TSS, ordered according to expression strength (NEW Figure S4C):

      We included a sentence describing the relationship of +1 nucleosome position with gene expression in L257-L258: Furthermore, the +1 nucleosome positioning is unaffected by the strength of gene expression (Figure S2C).

      __ Lack of Quality Control in the Pipeline __

      The authors claim (lines 152-153) that QC is performed at every stage, but this is not supported by the implementation. On the GitHub page (GitHub - uschwartz/nucDetective), QC steps are only marked at the Profiler stage using standard tools (FastQC, MultiQC). The Inspector stage, which is crucial for validating nucleosome detection, lacks QC entirely. The authors should implement additional steps to assess the quality of nucleosome calls. For example, how are false positives managed? ROC curves should be used to evaluate true positive vs. false positive rates when defining dynamic nucleosomes. How sequencing biases are adressed?

      The workflow overview chart on GitHub was not properly color coded. Therefore, we changed the graphics and highlighted the QC steps in the overview charts accordingly:

      Based on our long standing expertise of analysing MNase-seq data (PMID: 38959309, PMID: 37641864, PMID: 30496478, PMID: 25608606), the best quality metrics to assess the performance of the challenging MNase experiment are the fragment size distributions revealing the typical nucleosomal DNA lengths and the TSS plots showing a positioned +1 nucleosome and regularly phased nucleosome arrays downstream of the +1 nucleosome. Additionally, visual inspection of the nucleosome profiles in a genome browser is advisable. We make those quality metrics easily available in the nucDetective Profiler workflow (Insertsize Histogram, TSS plot and provide nucleosome profile bigwig files). Furthermore, the PC and correlation analysis based on the nucleosome occupancy in the inspector workflow allows to evaluate replicate reproducibility or integrity of time series data, as shown for data evaluated in this manuscript.

      The inspector workflow uses the well-established DANPOS toolkit to call nucleosome positions. Based on our experience, this step is particularly robust and well-established in the DANPOS toolkit (PMID: 23193179), so there is no need to reinvent it. Nevertheless, appropriate pre-processing of the data as done in the nucDetective pipeline is crucial to obtain highly resolved nucleosome positions. Using the final nucleosome profiles (bigwig) and the nucleosome reference positions (bed) as output of the Inspector workflow allows visual inspection of the called nucleosomes in a genome viewer. Furthermore, to avoid using false positive nucleosome positions for dynamic nucleosome analysis, we take only the 20% best positioned nucleosomes of each sample, as determined by the fuzziness score.

      We understand the value of a gold standard of dynamic nucleosomes to test performance using ROC curves. However, we are not aware that such a gold standard exists in the nucleosome analysis field, especially not when using multi-sample settings, such as time series data. One alternative would be to use simulated data; however, this has several limitations:

      • __Lack of biological complexity: __simulated data often fails to capture the full complexity of biological systems including the heterogeneity, variability, and subtle dependencies present in real-world data. Simplifications and omissions in simulation models can result in test datasets that are more tractable but less realistic, causing software to appear robust or accurate under idealized conditions, while underperforming on actual experimental data.
      • __Risks of Overfitting: __Software may be tuned to perform well on simulated datasets leading to overfitting and falsely inflated performance metrics. This undermines the predictive or diagnostic value of the results for real biological data
      • Poor Model Fidelity and Hidden Assumptions: The authenticity of simulated data is bounded by the fidelity of the underlying models. If those models are inaccurate or make untested assumptions, the generated data may not reflect real experimental or clinical scenarios. This can mask software shortcomings or bias validation toward specific, perhaps irrelevant, scenarios. Therefore, we decided to validate the performance of the pipeline in the biological context of the analyzed data:

      • PCA analysis of the individual nucleosome features shows a cyclic structure as expected for the IDC (Fig. 1D-G).

      • Nucleosome occupancy changes anti-correlate with chromatin accessibility (Fig. 3B) as expected.
      • Dynamic nucleosome features correlate with expression changes (Fig. 5C) We are aware that MNase-seq experiments might have sequence bias caused by the enzyme's endonuclease sequence preference (PMID: 30496478). However, the main aim of the nucDetective pipeline is to identify dynamic nucleosome features genome wide. Therefore, we are comparing the nucleosome features across multiple samples to find the positions in the genome with the highest variability. Comparisons are performed between the same nucleosome positions at the same genomic sites across multiple conditions, so the sequence context is constant and does not confound the analysis. This is like the differential expression analysis of RNA-seq data, where the gene counts are not normalized by gene length. Introducing a sequence normalization step might distort and bias the results of dynamic nucleosomes.

      We included a paragraph describing the limitations to the discussion (L447-457):

      Depending on the degree of MNase digestion, preferentially nucleosomes from GC rich regions are revealed in MNase-seq experiments (Schwartz et al. 2019). However, no sequence or gDNA normalisation step was included in the nucDetective pipeline. To identify dynamic nucleosomes, comparisons are performed between the same nucleosome positions at the same genomic sites across multiple samples. Hence, the sequence context is constant and does not confound the analysis. Introducing a sequence normalization step might even distort and bias the results. Nevertheless, it is highly advisable to use low MNase concentrations in chromatin digestions to reduce the sequence bias in nucleosome extractions. This turned out to be a crucial condition to obtain a homogenous nucleosome distribution in the AT-rich intergenic regions of eukaryotic genomes and especially in the AT-rich genome of Pf (Schwartz et al. 2019, Kensche et al. 2016).

      __ Use of Mono-nucleosomes Only __

      The authors re-analyze the Kensche et al. (2016) dataset using only mono-nucleosomes and claim improved nucleosome profiles, including identification of tandem arrays previously unreported in P. falciparum. Two key issues arise: 1. Is the apparent improvement due simply to focusing on mono-nucleosomes (as implied in lines 342-346)?

      The default setting in nucDetective is to use fragment sizes of 140 – 200 bp, which corresponds to the main mono-nucleosome fraction in standard MNase-seq experiments. However, the correct selection of fragment sizes may vary depending on the organism and the variations in MNase-seq protocols. Therefore, the pipeline offers the option of changing the cutoff parameter (--minLen; --maxLen), accordingly. Kensche et al thoroughly tested and established the best parameters for the data set. We agree with their selected parameters and used the same cutoffs (75-175 bp) in this manuscript. For this particular data set, the fragment size selection is not the reason why we obtain a better resolution. MNase-seq analysis is a multistep process which is optimized in the nucDetective pipeline. Differences in the analysis to Kensche et al are at the pre-processing stage and alignment step:

      Kensche et al. : “Paired-end reads were clipped to 72 bp and all data was mapped with BWA sample (Version 0.6.2-r126)”

      nucDetective:

      • Trimming using TrimGalore --paired -q 10 --stringency 2
      • Mapping using bowtie2 --very-sensitive –dovetail --no-discordant
      • MAPQ >= 20 filtering of aligned read-pairs (samtools). The manuscript text L379 was changed to

      This is achieved using MNase-seq optimized alignment settings, and proper selection of the fragment sizes corresponding to mono-nucleosomal DNA to obtain high resolution nucleosome profiles.

      How does the pipeline perform with di- or tri-nucleosomes, which are also biologically relevant (Kensche et al., 2016 and others)? Furthermore, the limitation to mono-nucleosomes is only mentioned in the methods, not in the results or discussion, which could mislead readers.

      The pipeline is optimized for mono-nucleosome analysis. However, the cutoffs for fragment size selection can be adjusted to analyse other fragment populations in MNase-seq data (--minLen; --maxLen). For example we know from previous studies that the settings in the pipeline could be used for sub-nucleosome analysis as well (PMID: 38959309). Di- or Tri-nucleosome analysis we have not explicitly tested. However, in a previous study (PMID: 30496478) we observed that the inherited MNase sequence bias is more pronounced in di-nucleosomes, which are preferentially isolated from GC-rich regions. This is in line with the depletion of di-nucleosomes in AT-rich intergenic regions in Pf, as was already described by Kensche et al.

      Changes to the manuscript text: We included a paragraph describing the limitations to the discussion (L428-434):

      The nucDetective pipeline has been optimized for the analysis of mono-nucleosomes. However, the selection of fragment sizes can be adjusted manually, enabling the pipeline to be used for other nucleosome categories. The pipeline is suitable to map and annotate sub-nucleosomal particles (

      __ Reference Nucleosome Numbers __

      The authors identify 49,999 reference nucleosome positions. How does this compare to previous analyses of similar datasets? This should be explicitly addressed.

      We thank the reviewer for this suggestion. In order to put our results in perspective, it is important to distinguish between reference nucleosome positions (what we reported in the manuscript) and all detectable nucleosomes. The reference positions are our attempt to build a set of nucleosome positions with strong evidence, allowing confident further analysis across timepoints. The selection of a well positioned subset of nucleosomes for downstream analysis has been done previously (PMID: 26578577) and the merging algorithm we used across timepoints is also used by DANPOS to decide if a MNase-Seq peak is a new nucleosome position or belongs to an existing position (PMID: 23193179).

      To be able to address the reviewer suggestion we prepared and added a table to the supplementary data, including the total number of all nucleosomes detected by our pipeline at each timepoint. We adjusted the results to the following (L223-226):

      “The pipeline identified a total of 127370 ± 1151 (mean ± SD) nucleosomes at each timepoint (Supplementary Data X). To exclude false positive positions in our analysis, we conservatively selected 49,999 reference nucleosome positions, representing sites with a well-positioned nucleosome at least at one time point (see Methods). Among these 1192 nucleosomes exhibited […]”

      Several groups reported nucleosome positioning data for P. falciparum (PMID: 20015349, PMID: 20054063, PMID: 24885191, PMID: 26578577), however only Ponts et al (2010) reported resolved numbers (~45000-90000 nucleosomes depending in development stage) and Bunnik et al reported ~ 75000 nucleosomes in a graph. Although we do not know the reason of why the other studies did not include specific numbers, we speculate that the data quality did not allow them to confidently report a number. In fact, nucleosomal reads are severely depleted in AT-rich intergenic regions in the Ponts and Bunnik datasets. In contrast, Kensche et al (and our analysis) shows that nucleosomes can be identified throughout the genome of Pf. Therefore, the nucleosome numbers reported by Ponts et al and Bunnik et al are very likely underestimated.

      We included the following text in the discussion, addressing previously published datasets (L404 – 405):

      “For example, our pipeline was able to identify a total of ~127,000 nucleosomes per timepoint (=5.4 per kb) in range with observed nucleosome densities in other eukaryotes (typically 5 to 6 per kb). From these, we extracted 49,999 reference nucleosome positions with strong positioning evidence across all timepoints, which we used to characterize nucleosome dynamics of Pf longitudinally. Previous studies of P. falciparum chromatin organization, did not report a total number of nucleosomes (Westenberger et al. 2009, Kensche et al. 2016), or estimated approximately ~45000-90000 nucleosomes across the genome at different developmental stages (Bunnik et al. 2014, Ponts et al. 2010). However, this value likely represents an underestimation due to the depletion of nucleosomal reads in AT-rich intergenic regions observed in their datasets.”

      __ Figure 1B and Nucleosome Spacing __

      The authors claim that Figure 1B shows developmental stage-specific variation in nucleosome spacing. However, only T35 shows a visible upstream change at position 0. In A4, A6, and A8 (Figure S4), no major change is apparent. Statistical tests are needed to validate whether the observed differences are significant and should be described in the figure legends and main text.

      We would like to thank the reviewer for bringing this issue to our attention. We apologize for an error we made, wrongly labelling the figure numbers. The differences in nucleosome spacing across time are visible in Figure 1C. Figure 1B shows the precise array structure of the Pf nucleosomes, when centered on the +1 nucleosome, and is mentioned before. The mistake is now corrected.

      In Figure 1C the mean NRL and 95% confidence interval are depicted, allowing a visual assessment of data significance (non-overlapping 95% CI-Intervals correspond to p Taken together we corrected this mistake and edited the text as follows (L194 – 199):

      “With this +1 nucleosome annotation, regularly spaced nucleosome arrays downstream of the TSS were detected, revealing a precise nucleosome organization in Pf (Figure 1B). Due to the high resolution maps of nucleosomes we can now observe significantvariations in nucleosome spacing depending on the developmental stage (Figure 1C, ANOVA on bootstrapped values (3 per timepoint) F₇,₇₂ = 35.10, p

      __ Genome-wide Occupancy Claims __

      The claim that nucleosomes are "evenly distributed throughout the genome" (Figure S2A) is questionable. Chromosomes 3 and 11 show strong peaks mid-chromosome, and chromosome 14 shows little to no signal at the ends. This should be discussed. Subtelomeric regions, such as those containing var genes, are known to have unique chromatin features. For instance, Lopez-Rubio et al. (2009) show that subtelomeric regions are enriched for H3K9me3 and HP1, correlating with gene silencing. Should these regions not display different nucleosome distributions? Do you expect the Plasmodium genome (or any genome) to have uniform nucleosome distribution?

      On global scale (> 10 kb) we would expect a homogenous distribution of nucleosomes genome wide, regardless of euchromatin or heterochromatin. We have shown this in a previous study for human cells (PMID: 30496478), which was later confirmed for drosophila melongaster (PMID: 31519205,PMID: 30496478) and yeast (PMID: 39587299).

      However, Figure S2A shows the distribution of the dynamic nucleosome features during the IDC, called with our pipeline. We agree with the reviewer, that there are a few exceptions of the uniform distribution, which we address now in the manuscript.

      Furthermore, we agree with the reviewer that the H3K9me3 / HP1 subtelomeric regions are special. Those regions are depleted of dynamic nucleosomes in the IDC as shown in Fig. 2D and now mentioned in L280 - L282.

      We included an additional genome browser snapshot in Supplemental Figure S2B and changed the text accordingly (L245-249):

      We observed a few exceptions to the even distribution of the nucleosomes in the center of chromosome 3, 11 and 12, where nucleosome occupancy changes accumulated at centromeric regions (Figure S2B). Furthermore, the ends of the chromosomes are rather depleted of dynamic nucleosome features.

      Genome browser snapshot illustrating accumulation of nucleosome occupancy changes at a centromeric site. Centered nucleosome coverage tracks (T5-T40 colored coverage tracks), nucleosomes occupancy changes (yellow bar) and annotated centromers (grey bar) taken from (Hoeijmakers et al., 2012)

      Dependence on DANPOS

      The authors criticize the DANPOS pipeline for its limitations but use it extensively within nucDetective. This contradiction confuses the reader. Is nucDetective an original pipeline, or a wrapper built on existing tools?

      One unique feature of the nucDetective pipeline is to identify dynamic nucleosomes (occupancy, fuzziness, regularity, shifts) in complex experimental designs, such as time series data (Inspector workflow). To our knowledge, there is no other tool for MNase-seq data which allows multi-condition/time-series comparisons (PMID: 35061087). For example, DANPOS allows only pair-wise comparisons, which cannot be used for time-series data. For the analysis of dynamic nucleosome features we require nucleosome profiles and positions at high resolution. For this purpose, several tools do already exist (PMID: 35061087). However, researchers without experience in MNase-seq analysis often find the plethora of available tools overwhelming, which makes it challenging to select the most appropriate ones. Here we share our experience and provide the user an automated workflow (Profiler), which builds on existing tools.

      In summary the Profiler workflow is a wrapper built on existing tools and the Inspector workflow is partly a wrapper (uses DANPOS to normalize nucleosome profiles and call nucleosome positions) and implements our original algorithm to detect dynamic nucleosome features in multiple conditions / time-series data.

      __ Control Data Usage __

      The authors should clarify whether gDNA controls were used throughout the analysis, as done in Kensche et al. (2016). Currently, this is mentioned only in the figure legend for Figure 5, not in the methods or results.

      We used the gDNA normalisation to optimize the visualization of the nucleosome depleted region upstream of the TSS in Fig 5A. Otherwise, we did not normalize the data by the gDNA control. The reason is the same as we did not include sequence normalization in the pipeline (see comment above)

      We included a paragraph describing the limitations to the discussion (L447-457):

      Depending on the degree of MNase digestion, preferentially nucleosomes from GC rich regions are revealed in MNase-seq experiments (Schwartz et al. 2019). However, no sequence or gDNA normalisation step was included in the nucDetective pipeline. To identify dynamic nucleosomes, comparisons are performed between the same nucleosome positions at the same genomic sites across multiple samples. Hence, the sequence context is constant and does not confound the analysis. Introducing a sequence normalization step might even distort and bias the results. Nevertheless, it is highly advisable to use low MNase concentrations in chromatin digestions to reduce the sequence bias in nucleosome extractions. This turned out to be a crucial condition to obtain a homogenous nucleosome distribution in the AT-rich intergenic regions of eukaryotic genomes and especially in the AT-rich genome of Pf (Schwartz et al. 2019, Kensche et al. 2016).

      We added following statement to the methods part: Additionally, the TSS profile shown in Figure 5A was normalized by the gDNA control for better NDR visualization.

      __ Lack of Statistical Power for Time-Series Analyses __

      Although the pipeline is presented as suitable for time-series data, it lacks statistical tools to determine whether differences in nucleosome positioning or fuzziness are significant across conditions. Visual interpretation alone is insufficient. Statistical support is essential for any differential analysis.

      We understand the value of statistical support in such an analysis. However, in biology we often face the limitations in terms of the appropriate sample sizes needed to accurately estimate the variance parameters required for statistical modeling. As MNase-seq experiments require a large amount of input material and high sequencing depth, the number of samples in most experiments is low, often with only two replicates (PMID: 23193179). Therefore, we decided that the nucDetective pipeline should be rather handled as a screening method to identify nucleosome features with high variance across all conditions. This prevents misuse of p-values. A common misinterpretation we observed is the use of non-significant p-values to conclude that no biological change exists, despite inadequate statistical power to detect such changes. We included a paragraph in the limitations section discussing the limitations of statistical analysis of MNase-Seq data.

      Changes to the manuscript text: We included a paragraph describing the limitations to the discussion (L435-446).

      As MNase-seq experiments require a large amount of input material and high sequencing depths, most published MNase-seq experiments do not provide the appropriate sample sizes required to accurately estimate the variance parameters necessary for statistical modelling (Chen et al. 2013). Therefore, dynamic nucleosomes are not identified through statistical testing but rather by ranking nucleosome features according to their variance across all samples and applying a variance threshold to distinguish them. This concept is well established to identify super-enhancers (Whyte et al. 2013). In this study we set the variance cutoff to a slope of 3, resulting in a high data confidence. However, other data sets might require further adjustment of the variance cutoff, depending on data quality or sequencing depth. The nucDetective identification of dynamic nucleosomes can be seen as a screening approach to provide a holistic overview of nucleosome dynamics in the system, which provides a basis for further research.

      Reproducibility of Methods

      The Methods section is not sufficient to reproduce the results. The GitHub repository lacks the necessary code to generate the paper's figures and focuses on an exemplary yeast dataset. The authors should either: o Update the repository with relevant scripts and examples, o Clearly state the repository's purpose, or o Remove the link entirely. Readers must understand that nucDetective is dedicated to assessing nucleosome fuzziness, occupancy, shift, and regularity dynamics-not downstream analyses presented in the paper.

      We thank the reviewer for this helpful comment. In addition to the main nucDetective repository, a second GitHub link is provided in the Data Availability section, which contains the scripts used to generate the figures presented in the paper. This separation was intentional to distinguish the general-purpose nucDetective tool from the project-specific analyses performed for this study. We acknowledge that this may not have been sufficiently clear.

      To have all resources available at a single citable permanent location we included a link to the corresponding Zenodo repository (https://doi.org/10.5281/zenodo.16779899) in the Data and materials availability statement.

      The Zenodo repository contains:

      Code (scripts.zip) and annotation of Plasmodium falciparum (Annotation.zip) to reproduce the nucDetective v1.1 (nucDetective-1.1.zip) analysis as done in the research manuscript entitled "Deciphering chromatin architecture and dynamics in Plasmodium falciparum using the nucDetective pipeline".

      The folder "output_nucDetective" conains the complete output of the nucDetective analysis pipeline as generated by the "01_nucDetective_profiler.sh" and "02_nucDetective_inspector.sh" scripts.

      Nucleosome coverage tracks, annotation of nucleosome positions and dynamic nucleosomes are deposited additonally in the folder "Pf_nucleosome_annotation_of_nucDetective".

      To make this clearer we added following text to Material and Methods in ”The nucDetective pipeline” section:

      Changes in the manuscript text (L518-519):

      The code, software and annotations used to run the nucDetective pipeline along with the output have been deposited on Zenodo (https://doi.org/10.5281/zenodo.16779899).

      __ Supplementary Tables __

      Including supplementary tables showing pipeline outputs (e.g., nucleosome scores, heatmaps, TSS extraction) would help readers understand the input-output structure and support figure interpretations.

      See comments above.

      We included a link to the corresponding Zenodo repository (https://doi.org/10.5281/zenodo.16779899) in the Data and materials availability statement.

      The repository contains:

      Code (scripts.zip) and annotation of Plasmodium falciparum (Annotation.zip) to reproduce the nucDetective v1.1 (nucDetective-1.1.zip) analysis as done in the research manuscript entitled "Deciphering chromatin architecture and dynamics in Plasmodium falciparum using the nucDetective pipeline".

      The folder "output_nucDetective" conains the complete output of the nucDetective analysis pipeline as generated by the "01_nucDetective_profiler.sh" and "02_nucDetective_inspector.sh" scripts.

      Minor Comments:

      The authors should moderate claims such as "no studies have reported a well-positioned +1 nucleosome" in P. falciparum, as this contradicts existing literature. Similarly, avoid statements like "poorly understood chromatin architecture of Pf," which undervalue extensive prior work (e.g., discovery of histone lactylation in Plasmodium, Merrick et al., 2023).

      We would like to clarify that we neither wrote that ““no studies have reported a well-positioned +1 nucleosome”” in P. falciparum nor did we intend to imply such thing. However, we acknowledge that our original wording may have been unclear. To address this, we have revised the manuscript to explicitly acknowledge prior studies on chromatin organization and highlight our contribution.

      In the Abstract L26-L30: Contrary to the current view of irregular chromatin, we demonstrate for the first time regular phased nucleosome arrays downstream of TSSs, which, together with the established +1 nucleosome and upstream nucleosome-depleted region, reveal a complete canonical eukaryotic promoter architecture in Pf.

      Introduction L156-L159: For example, we identify a phased nucleosome array downstream of the TSS. Together with a well-positioned +1 nucleosome and an upstream nucleosome-free region. These findings support a promoter architecture in Pf that resembles classical eukaryotic promoters (Bunnik et al. 2014, Kensche et al. 2016).

      Results L180-L183: These new Pf nucleosome maps reveal a nucleosome organisation at transcription start sites (TSS) reminiscent of the general eukaryotic chromatin structure, featuring a reported well-positioned +1 nucleosome , an upstream nucleosome-free region (NFR, Bunnik et al. 2014, Kensche et al. 2016), and shown for the first time in Pf, a phased nucleosome array downstream of the TSS.

      Discussion L412-L421: Previous analyses of Pf chromatin have identified +1 nucleosomes and NFRs (Bunnik et al 2014, Kensche et al. 2016). Here we extend this understanding by demonstrating phased nucleosome array structures throughout the genome. This finding provides evidence for a spatial regulation of nucleosome positioning in Pf, challenging the notion that nucleosome positioning is relatively random in gene bodies (Bunnik et al. 2014, Kensche et al. 2016). Consequently our results contribute to the understanding that Pf exhibits a typical eukaryotic chromatin structure, including well-defined nucleosome positioning at the TSS and regularly spaced nucleosome arrays (Schones et al. 2008; Yuan et al. 2005).

      The phrase “poorly understood chromatin architecture” has been modified to “underexplored chromatin architecture” in order to more accurately reflect the potential for further analyses and contributions to the field, while avoiding any potential misinterpretation of an attempt to undervalue previous work.

      Track labels in figures (e.g., Figure 5B) are too small to be legible.

      We made the labels bigger.

      Several figures (e.g., Figure 5B, S4B) lack statistical significance tests. Are the differences marked with stars statistically significant or just visually different?

      We added statistics to S4B.

      Differences in 5B were identified by visual inspection. To clarify this, we exchanged the asterisks to arrows in Fig.5B and changed the text in the legend:

      Arrows mark descriptive visual differences in nucleosome occupancy.

      Figure S3 includes a small black line on top of the table. Is this an accidental crop?

      We checked the figure carefully; however, the black line does not appear in our PDF viewer or on the printed paper

      The authors should state the weaknesses and limitations of this pipeline.

      We added a limitation section in discussion, see comments above

      Reviewer #1 (Significance (Required)):

      The proposed pipeline is useful and timely. It can benefit research groups willing to analyse MNase-Seq data of complex genomes such as P. falciparum. The tool requires users to have extensive experience in coding as the authors didn't include any clear and explicit codes on how to start processing the data from raw files. Nevertheless, there are multiple tool that can detect nucleosome occupancy and that are not cited by the authors not mention. I have included for the authors a link where a large list of tools for analysis of nucleosome positioning experiments tools/pipelines were developed for (Software to analyse nucleosome positioning experiments - Gene Regulation - Teif Lab). I think it would be useful for the authors to direct the reference this.

      We appreciate the reviewer’s valuable suggestion. We included a citation to the comprehensive database of nucleosome analysis tools curated by the Teif lab (Shtumpf et al., 2022). We chose to reference only selected tools in addition to this resource rather than listing all individual tools to maintain clarity and avoid overloading the manuscript with numerous citations.

      Despite valid, I still believe that controlling their pipeline by filtering out false positives and including more QC steps at the Inspector stage is strongly needed. That would boost the significance of this pipeline.

      We thank the reviewer for the assessment of our study and for recognizing that our MNase-seq analysis pipeline nucDetective can be a useful tool for the chromatin community utilizing MNase-Seq in complex settings.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, Holzinger and colleagues have developed a new pipeline to assess chromatin organization in linear space and time. They used this pipeline to reevaluate nucleosome organization in the malaria parasite, P. falciparum. Their analysis revealed typical arrangement of nucleosomes around the transcriptional start site. Furthermore, it further strengthened and refined the connection between specific nucleosome dynamics and epigenetic marks, transcription factor binding sites or transcriptional activity.

      Major comments

      • I am wondering what is the main selling point of this manuscript is. If it is the development of the nucDetective pipeline, perhaps it would be best to first benchmark it and directly compare it to existing tools on a dataset where nucleosome fussiness, shifting and regularity has been analyzed before. If on the other hand, new insights into Plasmodium chromatin biology is the primary target validation of some of the novel findings would be advantageous (e.g. refinement of TSS positions, relevance of novel motifs, etc).

      NucDetective presents a novel pipeline to identify dynamic nucleosome properties within different datasets, like time series or developmental stages, as analysed for the erythrocytic cycle in this manuscript. As such kind of a pipeline, allowing direct comparisons, does not exist for MNase-Seq data, we used the existing analysis and high quality dataset of Kensche et al., to visualize the strong improvements of this kind of analysis. Accordingly, we combined the pipeline development and the reasearch of chromatin structure analysis, being able to showcase the utility of this new pipeline.

      • The authors identify a strong positioning of +1 nucleosome by searching for a positioned nucleosomes in the vicinity of the assigned TSS. Given the ill-defined nature of TSSs, this approach sounds logic at first glance. However, given the rather broad search space from -100 till +300bp, I am wondering whether it is a sort of "self-fulfilling prophecy". Conversely, it would be good to validate that this approach indeed helps to refine TSS positions.

      We thank the reviewer for raising this important point. We would like to clarify that we do not claim to redefine or precisely determine TSS positions in our study. Instead, we use annotated TSS coordinates as a reference to identify nucleosomes that correspond to the +1 nucleosome, based on their proximity to the TSS.

      We selected the search window from -100 to +300 bp to account for known variability in Pf TSS annotation. For example, dominant transcription start sites identified by 5'UTR-seq tag clusters can differ by several hundred base pairs within a single time point (Chappell et al., 2020). The broad window thus allows us to capture the principal nucleosome positions near a TSS, even when the TSS itself is imprecise or heterogeneous. Based on the TSS centered plots (Figure 2C and Figure S1B), we reasoned that a window of -100 to +300 is sufficient to capture the majority of the +1 nucleosomes, which would have been missed by using smaller window sizes. This strategy aligns with well-established conventions in yeast chromatin biology, where the +1 nucleosome is defined relative to the TSS (Jiang and Pugh, 2009; Zhang et al. 2011) and commonly used as an anchor point to visualize downstream phased nucleosome arrays and upstream nucleosome-depleted regions (Rossi et al., 2021; Oberbeckmann et al., 2019; Krietenstein et al., 2016 and many more). Accordingly, our approach leverages these accepted standards to interpret nucleosome positioning without re-defining TSS annotations.

      • Figure 1C: I am wondering how should the reader interpret the changes in nucleosomal repeat length changes throughout the cycle. Is linker DNA on average 10 nucleotides shorter at T30 compared to T5 timepoint? If so how could such "dramatic reorganization" be achieved at the molecular level in absence of a known linker DNA-binding protein. More importantly is this observation supported by additional evidence (e.g. dinucleosomal fragment length) or could it be due to slightly different digestion of the chromatin at the different stages or other technical variables?

      We thank the reviewer for this insightful question regarding the interpretation of NRL changes across the cell cycle. The reviewer is right in her or his interpretation – linker DNA is on average ~10 bp shorter at T30 than at T5.

      To address concerns about additional evidence and potential MNase digestion variability, we now analyzed MNase-seq fragment sizes by shifting mononucleosome peaks of each time point to the canonical 147 bp length, to correct for MNase digestion differences. After this normalisation, dinucleosome fragment length distributions revealed the shortest linker lengths at T30 and T35, whereas T5 and T10 showed longer DNA linkers. These results confirm our previous NRL measurements based on mononucleosomal read distances while controlling for MNase digestion bias.

      The molecular basis of this reorganization, is still unclear. While linker histone H1 is considered absent in Plasmodium falciparum, presence of an uncharacterized linker DNA–binding protein or alternative factors fulfilling a similar role can not be excluded (Gill et al. 2010). However, H1 absence across all developmental stages, fails to explain stage-specific chromatin changes. We hypothesize that Apicomplexans evolved specialized chromatin remodelers to compensate for the missing H1, which may also drive the dynamic NRL changes observed. The low NRL coincides with high transcriptional activity in Pf during trophozoite stage is consistent with previous reports linking elevated transcription to reduced NRL in other eukaryotes (Baldi et al. 2018). In addition, the schizont stage involves multiple rounds of DNA replication requiring large histone supplies being produced during that time. It may well be that a high level of histone synthesis and DNA amplification, results in a short time period with increased nucleosome density and shorter NRL, until the system reaches again equilibrium (Beshnova et al. 2014). Although speculative we suggest a model wherein increased transcription promotes elevated nucleosome turnover and re-assembly by specialized remodeling enzymes, combined with high abundance of histones, resulting in higher nucleosome density and decreased NRL. Unfortunately, absolute quantification of nucleosome levels from this MNase-seq dataset is not possible without spike-in controls, which makes it infeasible to test the hypothesis with the available data set (Chen et al. 2016).

      Minor comments

      • I am wondering whether fuzziness and occupancy changes are truly independent categories. I am asking as both could lead to reduction of the signal at the nucleosome dyad and because they show markedly similar distribution in relation to the TSS and associate with identical epigenetic features (Figure 2B-D). Figure 2A indicates minimal overlap between them, but this could be due to the fact that the criteria to define these subtypes is defined such to place nucleosomes to one or the other category, but at the end they represent two flavors of the same thing.

      Indeed, changes in occupancy and fuzziness can appear related because both features may reduce signal intensity at the nucleosome dyad and both are connected to “poor nucleosome positioning”. However, their definitions and measurements are clearly distinct and technically independent. Occupancy reflects the peak height at the nucleosome dyad, while fuzziness quantifies the spread of reads around the peak, measured as the standard deviation of read positions within each nucleosome peak (Jiang and Pugh, 2009; Chen et al., 2013). Although a reduction in occupancy can contribute to increased fuzziness by diminishing the dyad axis signal, fuzziness primarily arises from increased variability in the flanking regions around the nucleosome position center. While this distinction is established in the field, it is also often confused by the concept of well (high occupancy, low fuzziness) and poorly (high fuzziness, low occupancy) positioned nucleosomes, where both of these features are considered.

      • Do the authors detect spatial relationship between fuzzy and repositioned/evicted nucleosomes at the level of individual nucleosomes pairs. With other words, can fuzziness be the consequence of repositioning/eviction of the neighboring nucleosome?

      In Figure 2A we analyse the spatial overlap of all features to each other. The analysis clearly shows that fuzziness, occupancy changes and position changes occur mostly at distinct spatial sites (overlaps between 3 and 10%, Fig. 2A). Therefore, we suggest that the features correspond to independent processes. Likewise, we do observe an overlap between occupancy and ATAC-seq peaks, but not nucleosome positioning shifts, clearly discriminating different processes.

      • Figure 4: enrichment values and measure of statistical significance for the different motifs are missing. Also have there been any other motifs identified.

      This information is present in Supplemental Figure S3. Here we show the top 3 hits in each cluster. In the figure legend of Figure 4 we reference to Fig. S3:

      L1054 –1055:

      “Additional enriched motifs along with the significance of motif enrichment and the fraction of motifs at the respective nucleosome positions are shown in Figure S3”

      • The M&M would benefit from some more details, e.g. settings in the piepline, or which fragment sizes were used to map the MNase-seq data?

      We included a link to the corresponding Zenodo repository (https://doi.org/10.5281/zenodo.16779899) in the Data and materials availability statement.

      The repository contains:

      Code (scripts.zip) and annotation of Plasmodium falciparum (Annotation.zip) to reproduce the nucDetective v1.1 (nucDetective-1.1.zip) analysis as done in the research manuscript entitled "Deciphering chromatin architecture and dynamics in Plasmodium falciparum using the nucDetective pipeline".

      The folder "output_nucDetective" conains the complete output of the nucDetective analysis pipeline as generated by the "01_nucDetective_profiler.sh" and "02_nucDetective_inspector.sh" scripts.

      Nucleosome coverage tracks, annotation of nucleosome positions and dynamic nucleosomes are deposited additonally in the folder "Pf_nucleosome_annotation_of_nucDetective".

      To make this clearer we added following text to Material and Methods in ”The nucDetective pipeline” section:

      Changes in the manuscript (L518-519):

      The code, software and annotations used to run the nucDetective pipeline along with the output have been deposited on Zenodo (https://doi.org/10.5281/zenodo.16779899).

      which fragment sizes were used to map the MNase-seq data?

      The default setting in nucDetective is to use fragment sizes of 140 – 200 bp, which corresponds to the main mono-nucleosome fraction in standard MNase-seq experiments. However, the correct selection of fragment sizes may vary depending on the organism and the variations in MNase-seq protocols. Therefore, the pipeline offers the option of changing the cutoff parameter (--minLen; --maxLen), accordingly. Kensche et al thoroughly tested the best selection of the fragment sizes for the data set, which is used in this manuscript. We agree with their selection and used the same cutoffs (75-175 bp).

      This is stated in line 535-536:

      The fragments are further filtered to mono-nucleosome sized fragments (here we used 75 – 175 bp)

      We changed the text:

      The fragments are further filtered to mono-nucleosome sized fragments (default setting 140-200 bp; changed in this study to 75 – 175 bp)

      We highlighted other parameters used in this study in the material and methods part.

      Reviewer #2 (Significance (Required)):

      Overall, the manuscript is well written and findings are clearly and elegantly presented. The manuscript describes a new pipeline to map and analyze MNase-seq data across different stages or conditions, though the broader applicability of the pipeline and advancements over existing tools could be better demonstrated. Importantly, the manuscript make use of this pipeline to provide a refined and likely more accurate view on (the dynamics of) nucleosome positioning over the AT-rich genome of P. falciparum. While these observations make sense they remain rather descriptive/associative and lack further experimental validation. Overall, this manuscript could be interest to both researchers working on chromatin biology and Plasmodium gene-regulation.

      We thank the reviewer for the assessment of our study and for recognizing that the results of our MNase-seq analysis pipeline nucDetective contribute to a better understanding of Pf chromatin biology.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript "Deciphering chromatin architecture and dynamics in Plasmodium 2 falciparum using the nucDetective pipeline" describes computational analysis of previously published data of P falciparum chromatin. This work corrects the prevailing view that this parasitic organism has an unusually disorganized chromatin organization, which had been attributed to its high genomic AT content, lack of histone H1, and ancient derivation. The authors show that instead P falciparum has a very typical chromatin organization. Part of the refinement is due to aligning data on +1 nucleosome positions instead of TSSs, which have been poorly mapped. The computational tools corral some useful features, for querying epigenomic structure that make visualization straightforward, especially for fuzzy nucleosomes.

      Reviewer #3 (Significance (Required)):

      As a computational package this is a nice presentation of fairly central questions. The assessment and display of fuzzy nucleosomes is a nice feature.

      We thank the reviewer for the assessment of our study and are pleased that the reviewer acknowledges the value and usability of our pipeline.

    1. Reviewer #2 (Public review):

      Summary:

      This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies).

      There is a major concern about the effects of age. See the results (155-190): this depends what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experience. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed.

      Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt if we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading.

      The authors respond to my summary comment by saying that prediction is individual and that they account for age-related effects in their models. But these aren't my concerns. Rather:

      (1) The texts (these edited newspaper articles) could be more predictable for younger than older adults. If so, effects with older adults could simply be because people are less likely to predict less than more predictable words.

      (2) The GPT-2 generated surprisal scores may correspond more closely to younger than older adult responses -- that is, its next word predictions may be more younger- than older-adult-like.

      In my view, the authors have two choices: they could remove the discussion of age-related effects, or they could try to address BOTH (1) and (2).

      As an aside, consider what we would conclude if we drew similar conclusions from a study in which children and adults read the same (children's) texts, but we didn't test what was predictable to each of them separately.

      The paper is really strong in other respects and if my concern is not addressed, the conclusions about age might be generally accepted.

    2. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      This manuscript reports a dual-task experiment intended to test whether language prediction relies on executive resources, using surprisal-based measures of predictability and an n-back task to manipulate cognitive load. While the study addresses a question under debate, the current design and modeling framework fall short of supporting the central claims. Key components of cognitive load, such as task switching, word prediction vs integration, are not adequately modeled. Moreover, the weak consistency in replication undermines the robustness of the reported findings. Below unpacks each point. 

      Cognitive load is a broad term. In the present study, it can be at least decomposed into the following components: 

      (1)  Working memory (WM) load: news, color, and rank. 

      (2)  Task switching load: domain of attention (color vs semantics), sensorimotor rules (c/m vs space).

      (3)  Word comprehension load (hypothesized against): prediction, integration. 

      The components of task switching load should be directly included in the statistical models. Switching of sensorimotor rules may be captured by the "n-back reaction" (binary) predictor. However, the switching of attended domains and the interaction between domain switching and rule complexity (1-back or 2-back) were not included. The attention control experiment (1) avoided useful statistical variation from the Read Only task, and (2) did not address interactions. More fundamentally, task-switching components should be directly modeled in both performance and full RT models to minimize selection bias. This principle also applies to other confounding factors, such as education level. While missing these important predictors, the current models have an abundance of predictors that are not so well motivated (see later comments). In sum, with the current models, one cannot determine whether the reduced performance or prolonged RT was due to affecting word prediction load (if it exists) or merely affecting the task switching load. 

      The entropy and surprisal need to be more clearly interpreted and modeled in the context of the word comprehension process. The entropy concerns the "prediction" part of the word comprehension (before seeing the next word), whereas surprisal concerns the "integration" part as a posterior. This interpretation is similar to the authors writing in the Introduction that "Graded language predictions necessitate the active generation of hypotheses on upcoming words as well as the integration of prediction errors to inform future predictions [1,5]." However, the Results of this study largely ignored entropy (treating it as a fixed effect) and only focus on surprisal without clear justification. 

      In Table S3, with original and replicated model fitting results, the only consistent interaction is surprisal x age x cognitive load [2-back vs. Reading Only]. None of the two-way interactions can be replicated. This is puzzling and undermines the robustness of the main claims of this paper. 

      Reviewer #2 (Public review):

      Summary

      This paper considers the effects of cognitive load (using an n-back task related to font color), predictability, and age on reading times in two experiments. There were main effects of all predictors, but more interesting effects of load and age on predictability. The effect of load is very interesting, but the manipulation of age is problematic, because we don't know what is predictable for different participants (in relation to their age). There are some theoretical concerns about prediction and predictability, and a need to address literature (reading time, visual world, ERP studies). 

      Strengths/weaknesses 

      It is important to be clear that predictability is not the same as prediction. A predictable word is processed faster than an unpredictable word (something that has been known since the 1970/80s), e.g., Rayner, Schwanenfluegel, etc. But this could be due to ease of integration. I think this issue can probably be dealt with by careful writing (see point on line 18 below). To be clear, I do not believe that the effects reported here are due to integration alone (i.e., that nothing happens before the target word), but the evidence for this claim must come from actual demonstrations of prediction. 

      The effect of load on the effects of predictability is very interesting (and also, I note that the fairly novel way of assessing load is itself valuable). Assuming that the experiments do measure prediction, it suggests that they are not cost-free, as is sometimes assumed. I think the researchers need to look closely at the visual world literature, most particularly the work of Huettig. (There is an isolated reference to Ito et al., but this is one of a large and highly relevant set of papers.) 

      There is a major concern about the effects of age. See the Results (161-5): this depends on what is meant by word predictability. It's correct if it means the predictability in the corpus. But it may or may not be correct if it refers to how predictable a word is to an individual participant. The texts are unlikely to be equally predictable to different participants, and in particular to younger vs. older participants, because of their different experiences. To put it informally, the newspaper articles may be more geared to the expectations of younger people. But there is also another problem: the LLM may have learned on the basis of language that has largely been produced by young people, and so its predictions are based on what young people are likely to say. Both of these possibilities strike me as extremely likely. So it may be that older adults are affected more by words that they find surprising, but it is also possible that the texts are not what they expect, or the LLM predictions from the text are not the ones that they would make. In sum, I am not convinced that the authors can say anything about the effects of age unless they can determine what is predictable for different ages of participants. I suspect that this failure to control is an endemic problem in the literature on aging and language processing and needs to be systematically addressed. 

      Overall, I think the paper makes enough of a contribution with respect to load to be useful to the literature. But for discussion of age, we would need something like evidence of how younger and older adults would complete these texts (on a word-by-word basis) and that they were equally predictable for different ages. I assume there are ways to get LLMs to emulate different participant groups, but I doubt that we could be confident about their accuracy without a lot of testing. But without something like this, I think making claims about age would be quite misleading. 

      We thank both reviewers for their constructive feedback and for highlighting areas where our theoretical framing and analyses could be clarified and strengthened. We have carefully considered each of the points raised and made substantial additions and revisions.

      As a summary, we have directly addressed the concerns raised by the reviewers by incorporating task-switching predictors into the statistical models, paralleling our focus on surprisal with a full analysis and interpretation of entropy, clarifying the robustness (and limitations) of the replicated findings, and addressing potential limitations in our Discussion.

      We believe these revisions substantially strengthen the manuscript and improve the reading flow, while also clarifying the scope of our conclusions. We will not illustrate these changes in more detail:

      (1) Cognitive load and task-switching components.

      We agree that cognitive load is a multifaceted construct, particularly since our secondary task broadly targets executive functioning. In response to Reviewer 1, we therefore examined task-switching demands more closely by adding the interaction term n-back reaction × cognitive load to a model restricted to 1-back and 2-back Dual Task blocks (as there were no n-back reactions in the Reading Only condition). This analysis showed significantly longer reading times in the 2-back than in the 1back condition, both for trials with and without an n-back reaction. Interestingly, the difference between reaction and no-reaction trials was smaller in the 2-back condition (β = -0.132, t(188066.09) = -34.269, p < 0.001), which may simply reflect the general increase in reading time for all trials so that the effect of the button press time decreases in comparison to the 1-back. In that sense, these findings are not unexpected and largely mirror the main effect of cognitive load. Crucially, however, the three-way interaction of cognitive load, age, and surprisal remained robust (β = 0.00004, t(188198.86) = 3.540, p < 0.001), indicating that our effects cannot be explained by differences in taskswitching costs across load conditions. To maintain a streamlined presentation, we opted not to include this supplementary analysis in the manuscript.

      (2) Entropy analyses.

      Reviewer 1 pointed out that our initial manuscript placed more emphasis on surprisal. In the revised manuscript, we now report a full set of entropy analyses in the supplementary material. In brief, these analyses show that participants generally benefit from lower entropy across cognitive load conditions, with one notable exception: young adults in the Reading Only condition, where higher entropy was associated with faster reading times. We have added these results to the manuscript to provide a more complete picture of the prediction versus integration distinction highlighted in the review (see sections “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and PostStimulus Predictive Processing” in the Methods and “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing“ in the Results).

      (3) Replication consistency.

      Reviewer 1 noted that the results of the replication analysis were somewhat puzzling. We take this point seriously and agree that the original model was likely underpowered to detect the effect of interest. To address this, we excluded the higher-level three-way interaction of age, cognitive load, and surprisal, focusing instead on the primary effect examined in this paper: the modulatory influence of cognitive load on surprisal. Using this approach, we observed highly consistent results between the original online subsample and the online replication sample.

      (4) Potential age bias in GPT-2.  

      We thank Reviewer 2 for their thoughtful and constructive feedback and agree that a potential age bias in GPT-2’s next-token predictions warrants caution. We thus added a section in the Discussion explicitly considering this limitation, and explain why it should not affect the implications of our study.

      Reviewer #1 (Recommendations for the authors):

      The d-prime model operates at the block level. How many observation goes into the fitting (about 175*8=1050)? How can the degrees of freedom of a certain variable go up to 188435? 

      We thank the reviewer for spotting this issue. Indeed, there was an error in our initial calculations, which we have now corrected in the manuscript. Importantly, the correction does not meaningfully affect the results for the analysis of d-primes or the conclusions of the study (see line 102).  

      “A linear mixed-effects model revealed n-back performance declined with cognitive load (β = -1.636, t(173.13) = -26.120, p < 0.001), with more pronounced effects with advancing age (β = -0.014, t(169.77) = -3.931, p > 0.001; Fig. 3b, Table S1)”.

      Consider spelling out all the "simple coding schemes" explicitly. 

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have now included the modelled contrasts in brackets after each predictor variable.

      “Example from line 527: In both models, we included recording location (online vs. lab), cognitive load (1-back and 2back Dual Task vs. Reading Only as the reference level) and continuously measured age (centred) in both models as well as the interaction of age and cognitive load as fixed effects”.

      The relationship between comprehension accuracy and strategies for color judgement is unclear or not intuitive. 

      We thank the reviewer for this helpful comment. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the block-level comprehension questions. However, we agree that this distinction may not have been entirely clear, and we have now added a brief clarification in the Methods section to address this point (see line 534):  

      “Please note that we did not control for trial-level stimulus colour here. The n-back task, which required participants to judge colours, was administered at the single-trial level, with colours pseudorandomised to prevent any specific colour - or sequence of colours - from occurring more frequently than others. In contrast, comprehension questions were presented at the end of each block, meaning that trial-level stimulus colour was unrelated to accuracy on the blocklevel comprehension questions”.

      Could you explain why comprehension accuracy is not modeled in the same way as d-prime, i.e., with a similar set of predictors? 

      This is a very good point. After each block, participants answered three comprehension questions that were intentionally designed to be easy: they could all be answered correctly after having read the corresponding text, but not by common knowledge alone. The purpose of these questions was primarily to ensure participants paid attention to the texts and to allow exclusion of participants who failed to understand the material even under minimal cognitive load. As comprehension accuracy was modelled at the block level with 3 questions per block, participants could achieve only discrete scores of 0%, 33.3%, 66.7%, or 100%. Most participants showed uniformly high accuracy across blocks, as expected if the comprehension task fulfilled its purpose. However, this limited variance in performance caused convergence issues when fitting a comprehension-accuracy model at the same level of complexity as the d′ model. To model comprehension accuracy nonetheless, we therefore opted for a reduced model complexity in this analysis.

      RT of previous word: The motivations described in the Methods, such as post-error-slowing and sequential modulation effects, lack supporting evidence. The actual scope of what this variable may account for is unclear.  

      We are happy to elaborate further regarding the inclusion of this predictor. Reading times, like many sequential behavioral measures, exhibit strong autocorrelation (Schuckart et al., 2025, doi: 10.1101/2025.08.19.670092). That is, the reading time of a given word is partially predictable from the reading time of the previous word(s). Such spillover effects can confound attempts to isolate trialspecific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the log reading time of the preceding trial as a covariate. This approach removes variance attributable to prior behavior, ensuring that the estimated effects reflect the influence of surprisal and cognitive load on the current word, rather than residual effects of preceding trials. We now added this explanation to the manuscript (see line 553):

      “Additionally, it is important to consider that reading times, like many sequential behavioural measures, exhibit strong autocorrelation (Schuckart et al., 2025), meaning that the reading time of a given word is partially predictable from the reading time of the previous word. Such spillover effects can confound attempts to isolate trial-specific cognitive processes. As our primary goal was to model single-word prediction, we explicitly accounted for this autocorrelation by including the reading time of the preceding trial as a covariate”.  

      Block-level d-prime: It was shown with the d-prime performance model that block-level d-prime is a function of many of the reading-related variables. Therefore, it is not justified to use them here as "a proxy of each participant's working memory capacity."

      We thank the reviewer for their comment. We would like to clarify that the d-prime performance model indeed included only dual-task d-primes (i.e., d-primes obtained while participants were simultaneously performing the reading task). In contrast, the predictor in question is based on singletask d-primes, which are derived from the n-back task performed in isolation. While dual- and singletask d-primes may be correlated, they capture different sources of variance, justifying the use of single-task d-primes here as a measure of each participant’s working memory capacity.

      Word frequency is entangled with entropy and surprisal. Suggest removal.

      We appreciate the reviewer’s comment. While word frequency is correlated with word surprisal, its inclusion does not affect the interpretation of the other predictors and does not introduce any bias. Moreover, it is a theoretically important control variable in reading research. Since we are interested in the effects of surprisal and entropy beyond potential biases through word length and frequency, we believe these are important control variables in our model. Moreover, checks for collinearity confirmed that word frequency was neither strongly correlated with surprisal nor entropy. In this sense, including it is largely pro forma: it neither harms the model nor materially changes the results, but it ensures that the analysis appropriately accounts for a well-established influence on word processing.

      Entropy reflects the cognitive load of word prediction. It should be investigated in parallel and with similar depth as surprisal (which reflects the load of integration).

      This is an excellent point that warrants further investigation, especially since the previous literature on the effects of entropy on reading time is scarce and somewhat contradictory. We have thus added additional analyses and now report the effects of cognitive load, entropy, and age on reading time (see sections “Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Results, “Control Analysis: Disentangling the Effect of Cognitive Load on Pre- and Post-Stimulus Predictive Processing” in the Methods as well as Fig. S7 and Table S6 in the Supplements for full results). In brief, we observe a significant three-way interaction among age, cognitive load, and entropy. Specifically, while all participants benefit from low entropy under high cognitive load, reflected by shorter reading times, in the baseline condition this benefit is observed only in older adults. Interestingly, in the baseline condition with minimal cognitive load, younger adults even show a benefit from high entropy. Thus, although the overall pattern for entropy partly mirrors that for surprisal – older adults showing increased reading times when word entropy is high and generally greater sensitivity to entropy variations – the effects differ in one important respect. Unlike for surprisal, the detrimental impact of increased word entropy is more pronounced under high cognitive load across all participants.

      Reviewer #2 (Recommendations for the authors):

      I agree in relation to prediction/load, but I am concerned (actually very concerned) that prediction needs to be assessed with respect to age. I suspect this is one reason why there is so much inconsistency in the effects of age in prediction and, indeed, comprehension more generally. I think the authors should either deal with it appropriately or drop it from the manuscript.

      Thank you for raising this important concern. It is true that prediction is a highly individual, complex process as it depends upon the experiences a person has made with language over their lifespan. As such, one-size-fits-all approaches are not sufficient to model predictive processing. In our study, we thus took particular care to ensure that our analyses captured both age-related and other interindividual variability in predictive processing.

      First, in our statistical models, we included age not only as a nuisance regressor, but also assessed age-related effects in the interplay of surprisal and cognitive load. By doing so, we explicitly model potential age-related differences in how individuals of different ages predict language under different levels of cognitive load.

      Second, we hypothesised that predictive processing might also be influenced by a range of interindividual factors beyond age, including language exposure, cognitive ability, and more transient states such as fatigue. To capture such variability, all models included by-subject random intercepts and slopes, ensuring that unmodelled individual differences were statistically accommodated.

      Together, these steps allow us to account for both systematic age-related differences and residual individual variability in predictive processing. We are therefore confident that our findings are not confounded by unmodelled age-related variability.

      Line 18, do not confuse prediction (or pre-activation) with predictability. Predictability effects can be due to integration difficulty. See Pickering and Gambi 2018 for discussion. The discussion then focuses on graded parallel predictions, but there is also a literature concerned with the prediction of one word, typically using the "visual world" paradigm (which is barely cited - Reference 60 is an exception). In the next paragraph, I would recommend discussing the N400 literature (particularly Federmeier). There are a number of reading time studies that investigate whether there is a cost to a disconfirmed prediction - often finding no cost (e.g., Frisson, 2017, JML), though there is some controversy and apparent differences between ERP and eye-tracking studies (e.g., Staub). This literature should be addressed. In general, I appreciate the value of a short introduction, but it does seem too focused on neuroscience rather than the very long tradition of behavioural work on prediction and predictability.

      We thank the reviewer for this suggestion. In the revised manuscript, we have clarified the relevant section of the introduction to avoid confusion between predictability and predictive processing, thereby improving conceptual clarity (see line 16).

      “Instead, linguistic features are thought to be pre-activated broadly rather than following an all-or-nothing principle, as there is evidence for predictive processing even for moderately- or low-restraint contexts (Boston et al., 2008; Roland et al., 2012; Schmitt et al., 2021; Smith & Levy, 2013)”.  

      We also appreciate the reviewer’s comment regarding the introduction. While our study is behavioural, we frame it in a neuroscience context because our findings have direct implications for understanding neural mechanisms of predictive processing and cognitive load. We believe that this framing is important for situating our results within the broader literature and highlighting their relevance for future neuroscience research.

      I don't think 2 two-word context is enough to get good indicators of predictability. Obviously, almost anything can follow "in the", but the larger context about parrots presumably gives a lot more information. This seems to me to be a serious concern - or am I misinterpreting what was done? 

      This is a very important point and we thank the reviewer for raising it. Our goal was to generate word surprisal scores that closely approximate human language predictions. In the manuscript, we report analyses using a 2-word context window, following recommendations by Kuribayashi et al. (2022).

      To evaluate the impact of context length, we also tested longer windows of up to 60 words (not reported). While previous work (Goldstein et al., 2022) shows that GPT-2 predictions can become more human-like with longer context windows, we found that in our stimuli – short newspaper articles of only 300 words – surprisal scores from longer contexts were highly correlated with the 2word context, and the overall pattern of results remained unchanged. To illustrate, surprisal scores generated with a 10-word context window and surprisal scores generated with the 2-word context window we used in our analyses correlated with Spearman’s ρ = 0.976.

      Additionally, on a more technical note, using longer context windows reduces the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window (e.g., a 50-word context would exclude ~17% of the data).  

      Importantly, while a short 2-word context window may introduce additional noise in the surprisal estimates, this would only bias effects toward zero, making our analyses conservative rather than inflating them. Critically, the observed effects remain robust despite this conservative estimate, supporting the validity of our findings.

      However, we agree that this is a particularly important and sensitive point, and have now added a discussion of it to the manuscript (see line 476).

      “Entropy and surprisal scores were estimated using a two-word context window. While short contexts have been shown to enhance GPT-2’s psychometric alignment with human predictions, making next-word predictions more human-like (Kuribayashi et al., 2022), other work suggests that longer contexts can also increase model–human similarity (Goldstein et al., 2022). To reconcile these findings in our stimuli and guide the choice of context length, we tested longer windows and found surprisal scores were highly correlated with the 2-word context (e.g., 10-word vs. 2-word context: Spearman’s ρ = 0.976), with the overall pattern of results unchanged. Additionally, employing longer context windows would have also reduced the number of analysable trials, since surprisal cannot be computed for the first k words of a text with a k-word context window. Crucially, any additional noise introduced by the short context biases effect estimates toward zero, making our analyses conservative rather than inflating them”.

      Line 92, task performance, are there interactions? Interactions would fit with the experimental hypotheses. 

      Yes, we did include an interaction term of age and cognitive load and found significant effects on nback task performance (d-primes; b = -0.014, t(169.8) = -3.913, p < 0.001), but not on comprehension question accuracy (see table S1 and Fig. S2 in the supplementary material).

      Line 149, what were these values?

      We found surprisal values ranged between 3.56 and 72.19. We added this information in the manuscript (see line 143).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors used fluorescence microscopy, image analysis, and mathematical modeling to study the effects of membrane affinity and diffusion rates of MinD monomer and dimer states on MinD gradient formation in B. subtilis. To test these effects, the authors experimentally examined MinD mutants that lock the protein in specific states, including Apo monomer (K16A), ATP-bound monomer (G12V), and ATPbound dimer (D40A, hydrolysis defective), and compared to wild-type MinD. Overall, the experimental results support the conclusion that reversible membrane binding of MinD is critical for the formation of the MinD gradient, but that the binding affinities between monomers and dimers are similar.  

      The modeling part is a new attempt to use the Monte Carlo method to test the conditions for the formation of the MinD gradient in B. subtilis. The modeling results provide good support for the observations and find that the MinD gradient is sensitive to different diffusion rates between monomers and dimers. This simulation is based on several assumptions and predictions, which raises new questions that need to be addressed experimentally in the future. However, the current story is sufficient without testing these assumptions or predictions.

      Reviewer #2 (Public review): 

      Summary:  

      Bohorquez et al. investigate the molecular determinants of intracellular gradient formation in the B. subtilis Min system. To this end, they generate B. subtilis strains that express MinD mutants that are locked in the monomeric or dimeric states, and also MinD mutants with amphipathic helices of varying membrane affinity. They then assess the mutants' ability to bind to the membrane and form gradients using fluorescence microscopy in different genetic backgrounds. They find that, unlike in the E. coli Min system, the monomeric form of MinD is already capable of membrane binding. They also show that MinJ is not required for MinD membrane binding and only interacts with the dimeric form of MinD. Using kinetic

      Monte Carlo simulations, the authors then test different models for gradient formation, and find that a MinD gradient along the cell axis is only formed when the polarly localized protein MinJ stimulates dimerization of MinD, and when the diffusion rate of monomeric and dimeric MinD differs. They also show that differences in the membrane affinity of MinD monomers and dimers are not required for gradient formation.  

      Strengths:  

      The paper offers a comprehensive collection of the subcellular localization and gradient formation of various MinD mutants in different genetic backgrounds. In particular, the comparison of the localization of these mutants in a delta MinC and MinJ background offers valuable additional insights. For example, they find that only dimeric MinD can interact with MinJ. They also provide evidence that MinD locked in a dimer state may co-polymerize with MinC, resulting in a speckled appearance.  

      The authors introduce and verify a useful measure of membrane affinity in vivo.  

      The modulation of the membrane affinity by using distinct amphipathic helices highlights the robustness of the B. subtilis MinD system, which can form gradients even when the membrane affinity of MinD is increased or decreased.  

      Weaknesses:  

      The main claim of the paper, that differences in the membrane affinity between MinD monomers and dimers are not required for gradient formation, does not seem to be supported by the data. The only measure of membrane affinity presented is extracted from the transverse fluorescence intensity profile of cells expressing the mGFP-tagged MinD mutants. The authors measure the valley-to-peak ratio of the profile, which is lower than 1 for proteins binding to the membrane and higher than 1 for cytosolic proteins. To verify this measure of membrane affinity, they use a membrane dye and a soluble GFP, which results in values of ~0.75 and ~1.25, respectively. They then show that all MinD mutants have a value - roughly in the range of 0.8-0.9 - and they use this to claim that there are no differences in membrane affinity between monomeric and dimeric versions.  

      While this way to measure membrane affinity is useful to distinguish between binders and non-binders, it is unclear how sensitive this assay is, and whether it can resolve more subtle differences in membrane affinity, beyond the classification into binders and non-binders. A dimer with two amphipathic helices should have a higher membrane affinity than a monomer with only one such copy. Thus, the data does not seem to support the claim that "the different monomeric mutants have the same membrane affinity as the wildtype MinD". The data only supports the claim that B. subtilis MinD monomers already have a measurable membrane affinity, which is indeed a difference from the E. coli Min system.  

      While their data does show that a stark difference between monomer and dimer membrane affinity may not be required for gradient formation in the B. subtilis case, it is also not prevented if the monomer is unable to bind to the membrane. They show this by replacing the native MinD amphipathic helix with the weak amphipathic helix NS4AB-AH. According to their membrane affinity assay, NS4AB-AH does not bind to the membrane as a monomer (Figure 4D), but when this helix is fused to MinD, MinD is still capable of forming a gradient (albeit a weaker one). Since the authors make a direct comparison to the E. coli MinDE systems, they could have used the E. coli MinD MTS instead or in addition to the NS4AB-AH amphipathic helix. The reviewer suspects that a fusion of the E. coli MinD MTS to B. subtilis MinD may also support gradient formation.  

      The paper contains insufficient data to support the many claims about cell filamentation and minicell formation. In many cases, statements like "did not result in cell filamentation" or "restored cell division" are only supported by a single fluorescence image instead of a quantitative analysis of cell length distribution and minicell frequency, as the one reported for a subset of the data in Figure 5.  

      The paper would also benefit from a quantitative measure of gradient formation of the distinct MinD mutants, instead of relying on individual fluorescent intensity profiles.  

      The authors compare their experimental results with the oscillating E. coli MinDE system and use it to define some of the rules of their Monte Carlo simulation. However, the description of the E. coli Min system is sometimes misleading or based on outdated findings.

      The Monte Carlo simulation of the gradient formation in B. subtilis could benefit from a more comprehensive approach:

      (1) While most of the initial rules underlying the simulation are well justified, the authors do not implement or test two key conditions:

      (a) Cooperative membrane binding, which is a key component of mathematical models for the oscillating E. coli Min system. This cooperative membrane binding has recently been attributed to MinD or MinCD oligomerization on the membrane and has been experimentally observed in various instances; in fact, the authors themselves show data supporting the formation of MinCD copolymers.  

      (2) Local stimulation of the ATPase activity of MinD which triggers the dimer-to-monomer transition; E. coli MinD ATP hydrolysis is stimulated by the membrane and by MinE, so B. subtilis MinD may also be stimulated by the membrane and/or other components like MinJ. Instead, the authors claim that (a) would only increase differences in diffusion between the monomer and different oligomeric species, and that a 2-fold increase in dimerization on the membrane could not induce gradient formation in their simulation, in the absence of MinJ stimulating gradient formation. However, a 2-fold increase in dimerization is likely way too low to explain any cooperative membrane binding observed for the E. coli Min system. Regarding (b), they also claim that implementing stimulation of ATP hydrolysis on the membrane (dimer-to-monomer transition) would not change the outcome, but no simulation result for this condition is actually shown.  

      (3) To generate any gradient formation, the authors claim that they would need to implement stimulation of dimer formation by MinJ, but they themselves acknowledge the lack of any experimental evidence for this assertion. They then test all other conditions (e.g., differences in membrane affinity, diffusion, etc.) in addition to the requirement that MinJ stimulates dimer formation. It is unclear whether the authors tested all other conditions independently of the "MinJ induces dimerization" condition, and whether either of those alone or in combination could also lead to gradient formation. This would be an important test to establish the validity of their claims.

      Reviewer #3 (Public review): 

      This important study by Bohorquez et al examines the determinants necessary for concentrating the spatial modulator of cell division, MinD, at the future site of division and the cell poles. Proper localization of MinD is necessary to bring the division inhibitor, MinC, in proximity to the cell membrane and cell poles where it prevents aberrant assembly of the division machinery. In contrast to E. coli, in which MinD oscillates from pole to pole courtesy of a third protein MinE, how MinD localization is achieved in B. subtilis - which does not encode a MinE analog - has remained largely a mystery. The authors present compelling data indicating that MinD dimerization is dispensable for membrane localization but required for concentration at the cell poles. Dimerization is also important for interactions between MinD and MinC, leading to the formation of large protein complexes. Computational modeling, specifically a Monte Carlo simulation, supports a model in which differences in diffusion rates between MinD monomers and dimers lead to the concentration of MinD at cell poles. Once there, interaction with MinC increases the size of the complex, further reinforcing diffusion differences. Notably, interactions with MinJ-which has previously been implicated in MinCD localization, are dispensable for concentrating MinD at cell poles although MinJ may help stabilize the MinCD complex at those locations.  

      Reviewer #1 (Recommendations for the authors):  

      (1) The title could be modified to better reflect the emphasis on MinD monomer and dimer diffusion rather than the fact that membrane affinity is not important in MinD gradient formation. In addition, because membrane association requires affinity for the membrane, this title seems inconsistent with statements in the main text, such as Lines 246-247: a reversible membrane association is important for the formation of a MinD gradient along the cell axis.

      We agree with the reviewer that the title can be more accurate, and we have now changed it to “Membrane affinity difference between MinD monomer and dimer is not crucial to MinD gradient formation in Bacillus subtilis”

      (2) This paper reports that the difference in diffusion rates between MinD monomers and dimers is an important factor in the formation of Bs MinD gradients. However, one can argue for the importance of MinD monomers in the cellular context. Since the abundance of ATP in cells often far exceeds the abundance of MinD protein molecules under experimental conditions, MinD can easily form dimers in the cytoplasm. How does the author address this problem?  

      It is a good point that ATP concentration in the cell likely favours dimers in the cytoplasm. However, what is important in our model is that there is cycling between monomer and dimer, rather than where exactly this happen. In fact, the gradients works essentially equally well if dimers can become monomers only whilst they are at the membrane, as we have mentioned in the manuscript (lines 324-326 in the original manuscript). However, in the original manuscript this simulation was not shown, and now we have included this in the new Fig. 8D & E.

      (3)The claim "This oscillating gradient requires cycling of MinD between a monomeric cytosolic and a dimeric membrane attached state." (Lines 46, 47) is not well supported by most current studies and needs to be revised since to my knowledge, most proposed models do not consider the monomer state. The basic reaction steps of Ec Min oscillations include ATP-bound MinD dimers attaching to the membrane that subsequently recruit more MinD dimers and MinE dimers to the membrane; MinE interactions stimulate ATP hydrolysis in MinD, leading to dissociation of ADP-bound MinD dimers from the membrane; nucleotide exchange occurs in the cytoplasm.  

      Here the reviewer refers to a sentence in a short “Importance” abstract that we have added. In fact, such abstract is not necessary, so we have removed it. Of note, the E. coli MinD oscillation, including the role of MinE, is described in detail in the Introduction. 

      A recent reference is a paper by Heermann et al. (2020; doi: 10.1016/j.jmb.2020.03.012), which considers the MinD monomer state, which is not mentioned in this work. How do their observations compare to this work?  

      The Heermann paper mentions that MinD bound to the membrane displays an interface for multimerization, and that this contributes to the local self-enhancement of MinD at the membrane. In our Discussion, we do mention that E. coli MinD can form polymers in vitro and that any multimerization of MinD dimers will further increase the diffusion difference between monomer and dimer, and might contribute to the formation of a protein gradient (lines 459-467). We have now included a reference to the Heermann paper (line 461).

      (4) Throughout the manuscript, errors in citing references were found in several places.                 

      We have corrected this where suggested.

      (5) The introduction may be somewhat misleading due to mixed information from experimental cellular results, in vitro reconstructions, and theoretical models in cells or in vitro environments. Some models consider space constraints, while others do not. Modifications are recommended to clarify differences.  

      See below for responses 

      (6) The citation for MinD monomers:

      The paper by Hu and Lutkenhaus (2003, doi: 10.1046/j.1365-2958.2003.03321.x.) contains experimental evidence showing monomer-dimer transition using purified proteins. Another paper by the same laboratory (Park et al. 2012, doi: 10.1111/j.1365-2958.2012.08110.x.) explained how ATP-induced dimerization, but this paper is not cited.  

      The Park et al. 2012 paper focusses at the asymmetric activation of MinD ATPase by MinE, which goes beyond the scope of our work. However, we have cited several other papers from the Lutkenhaus lab, including the Wu et al. 2011 paper describing the structure of the MinD-ATP complex.

      Other evidence comes from structural studies of Archaea Pyrococcus furiosus (1G3R) and Pyrococcus horikoshii (1ION), and thermophilic Aquifex aeolicus (4V01, 4V02, 4V03). As they may function differently from Ec MinD, they are less relevant to this manuscript.

      We agree. 

      (7) Lines 65, 66: Using the term 'a reaction-diffusion couple' to describe the biochemical facts by citing references of Hu and Lutkenhaus (1999) and Raskin and de Boer (1999) is not appropriate. The idea that the Min system behaves as a reaction-diffusion system was started by Howard et al. (2001), Meinhardt and de Boer (2001), and Huang et al. (2003) et al. In addition, references for MinE oscillation are missing. 

      We have now corrected this (line 52).

      (8) Lines 77-79: Citations are incorrect.

      ATP-induced dimerization: Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x), Park et al. (2012). C-terminal amphipathic helix formation: Szeto et al. (2003), Hu and Lutkenhaus (2003, DOI: 10.1046/j.1365-2958.2003.03321.x).

      Citations have been corrected.

      (9) Line 78: The C-terminal amphipathic helix is not pre-formed and then exposed upon conformational change induced by ATP-binding. This alpha-helical structure is an induced fold upon interaction with membranes as experimentally demonstrated by Szeto et al. (2003).  

      We have adjusted the text to correct this (lines 64-66).

      (10) Line 102: 'cycles between membrane association and dissociation of MinD' also requires MinE in addition to ATP.

      We believe that in the context of this sentence and following paragraph it is not necessary to again mention MinE, since it is focused on parallels between the E. coli and B. subtilis MinD membrane binding cycles.

      (11) In the introduction, could the author briefly explain to a general audience the difference between Monte Carlo and reaction-diffusion methods? How do different algorithms affect the results?

      The main difference between the kinetic Monte Carlo and typical reaction-diffusion methods which is relevant to our work is that the first is particle-based, and naturally includes statistical fluctuations (noise), whereas the second is field-based, and is in the normal implementation deterministic, so does not include noise. Whilst it should be noted that one can in principle include noise in the field-based reactiondiffusion methods, this is done rarely. Additionally, although we do not do this here, the kinetic MonteCarlo can also account, in principle, for particle shape (sphere versus rod), or for localized interactions (as sticky patches on the surface): therefore the kinetic Monte Carlo is more microscopic in nature. We have now shortly described the difference in lines 102-105.

      (12)  Lines 126-128: The second part of the sentence uses the protein structure of Pyrococcus furiosus MinD (Ref 37) to support a protein sequence comparison between Ec and Bs MinD. However, the structure of the dimeric E. coli MinD-ATP complex (3Q9L) is available, which is Reference 38 that is more suited for direct comparison.

      To discuss monomeric MinD from P. furiosus, it will be useful to include it in the primary sequence alignment in Figure S1.

      We do not think that this detailed information is necessary to add to Figure S1, since the mutants have been described before (appropriate citations present in the text).

      (13) Lines 127, 166: Where Figure S1 is discussed, a structural model of MinD will be useful alongside with the primary sequence alignment.

      We do not think that this detailed information is necessary to understand the experiments since the mutants have been described before.

      (14) Lines 131-132: Reference is missing for the sentence of " the conserved..."; Reference 38.  In Reference 38, there is no experimental evidence on G12 but inferred from structure analysis. Reference 26 discusses ATP and MinE regulation on the interactions between MinD and phospholipid bilyers; not about MinD dimerization.

      We have corrected this and added the proper references. 

      For easy reading, the mutant MinD phenotypes can be indicated here instead of in the figure legends, including K16A (apo monomer), MinD G12V (ATP-bound monomer), and MinD D40A (ATP-bound dimer, ATP hydrolysis deficient).  

      We have added the suggested descriptions of the mutants in the main text.

      (15) Lines 150-151: Unlike Ec MinD, which forms a clear gradient in one half of the cell, Bs MinD (wild type) mainly accumulates at the hemispheric poles. What percentage of a cell (or cell length) can be covered by the Bs MinD gradient? How does the shaded area in the longitudinal FIP compare to the area of the bacterial hemispherical pole? If possible, it might be interesting to compare with the range of nucleoid occlusion mechanisms that occur.

      Part of the MinD gradient covers the nucleoid area, since the fluorescence signal is still visible along the cell lengths, yet there is no sudden drop in fluorescence, suggesting that nucleoid exclusion does not play a role.

      (16)  Line 160: In addition to summarizing the membrane-binding affinity, descriptions of the differences in the gradient distribution or formation will be useful.  

      We have done this in lines 155-156 of the original manuscript: “The monomeric ATP binding G12V variant shows the same absence of a protein gradient as the K16A variant”.

      (17) Line 262: 'distribution' is not shown.  

      We do not understand this remark. This information is shown in Fig. 5B (now Fig. 6B).

      (18)  Line 287: Wrong citation for reference 31.

      Reference has been corrected.

      (19)  Line 288 and lines 596 regarding the Monte Carlo simulation:

      (a)  An illustration showing the reaction steps for MinD gradient formation will help understand the rationale and assumptions behind this simulation.

      We have added an illustration depicting the different modelling steps in the new Fig. 8.

      (b)  Equations are missing.

      (c)   A table summarizing the parameters used in the simulation and their values.

      (d)  For general readers, it will be helpful to convert the simulation units to real units.

      (e)  Indicate real experimental data with a citation or the reason for any speculative value.

      The Methods section provides a discussion of all parameters used in the potentials on which our kinetic Monte-Carlo algorithm is based. We have now also provided a Table in the SI (Table S1) with typical parameter values in both simulation units and real units. The experimental data and reasoning behind the values chosen are discussed in the Methods section (see “Kinetic Monte Carlo simulation”).

      (20)  Lines 320-321: Reference missing.

      The interaction between MinJ and the dimer form of MinD is based on our findings shown in the original Fig. S4, and this information has not been published before. We have rephrased the sentence to make it more clear. Of note, Fig. S4 has been moved to the main manuscript, at the request of reviewer #2, and is now new Fig. 2. 

      (21)  Lines 355-359: Is the statement specifically made for the Bs Min system? Is there any reference for the statement? Isn't the differences in diffusion rates between molecules 'at different locations' in the system more important than reducing their diffusion rates alone? It is unclear about the meaning of the statement "the Min system uses attachment to the membrane to slow down diffusion". Is this an assumption in the simulation?

      The statement is generic, however the reviewer has a good point and we have made this statement more clear by changing “considerably reduced diffusion rate” to “locally reduced diffusion rate” (line 359).

      (22) Line 403: Citation format.

      We have corrected the text and citation.

      (23) Lines 442-444: The parameters are not defined anywhere in the manuscript.

      Discussed in the M&M and in the new Table S1.

      (24) Lines 464-465: Regarding the final sentence, what does 'this prediction' refer to? Hasn't the author started with experimental observations, predicted possible factors of membrane affinity and diffusion rates, and used the simulation approach to disapprove or support the prediction?

      We have changed “prediction” to “suggestion”, to make it clear that it is related to the suggestion in the previous sentence that  “our modelling suggests that stimulation of MinD-dimerization at cell poles and cell division sites is needed.” (line 471).

      (25) Materials and Methods: Statistical methods for data analyses are missing.

      Added to “Microscopy” section.

      (26) References: References 34, 40, 51 are incomplete.

      References 34 and 40 have been corrected. Reference 51 is a book.

      (27)  Figures: The legends (Figures 1-7) can be shortened by removing redundant details in Material and Methods. Make sure statistical information is provided. The specific mutant MinD states, including Apo monomer, ATP-bound dimer, ATP hydrolysis deficient, and non-membrane binding etc can be specified in the main text. They are repeated in the legends of Figures 1 and 2.

      We have removed redundant details from the legends and provided statistical information.

      (28)  Supporting information:

      Table S1: Content of the acknowledgment statement may be moved to materials and methods and the acknowledgment section. Make sure statistical information is provided in the supporting figure legends.

      We are not sure what the reviewer means with the content acknowledgement in Table S1 (now Table S2). Statistical information has been added.

      Figure S1. Adding a MinD structure model will be useful.

      We do not think that a structural model will enlighten our results since our work is not focused at structural mutagenesis. The mutants that we use have been described in other papers that we have cited.

      Reviewer #2 (Recommendations for the authors):  

      The authors should cite and relate their data to the preprint by Feddersen & Bramkamp, BioRxiv 2024. ATPase activity of B. subtilis MinD is activated solely by membrane binding.

      We have now discussed this paper in relation to our data in lines 407-409. 

      I am not convinced the authors are able to make the statement in lines 160-161 based on their assay: "This confirmed that the different monomeric mutants have the same membrane affinity as wild-type MinD". It is unclear if measuring valley-to-peak ratios in their longitudinal profiles can resolve small differences in membrane affinity. Wildtype MinD should at least be dimeric, or (as the authors also note elsewhere) may even be present in higher-order structures and as such have a higher membrane affinity than a monomeric MinD mutant. The authors should rephrase the corresponding sections in the manuscript to state that the MinD monomer already has detectable membrane affinity, instead of stating that the monomer and dimer membrane affinity are the same.

      We agree that “the same affinity” is too strongly worded, and we have now rephrased this by saying that the different monomeric mutants have a comparable membrane affinity as wild type MinD (line 152).

      According to the authors' analysis, MinD-NS4B would not bind to the membrane as it has a valley-to-peak ratio higher than 1, similar to the soluble GFP. However, the protein is clearly forming a gradient, and as such probably binding to the membrane. The authors should discuss this as a limitation of their membrane binding measure.

      The ratio value of 1 is not a cutoff for membrane binding. As shown in Fig. 1F, GFP has a valley-topeak ratio close to 1.25, whereas the FM5-95 membrane dye has a ratio close to 0.75. In Fig. 3C (now Fig. 4C) we have shown that GFP fused with the NS4B membrane anchor has a lower ratio than free GFP, and we have shown the same in Fig. 4D (now Fig. 5D) for GFP-MinD-NS4B. The difference are small but clear, and not similar to GFP.

      The observation that MinD dimers are localized by MinJ is interesting and key to the rule of the Monte Carlo simulation that dimers attach to MinJ. However, the data is hidden in the supplementary information and is not analysed as comprehensively, e.g., it lacks the analysis of the membrane binding. The paper would benefit from moving the fluorescence images and accompanying analysis into the main text.  

      We have moved this figure to the main text and added an analysis of the fluorescence intensities (new Fig. 2).

      The authors should show the data for cell length and minicell formation, not only for the MinDamphipathic helix versions (Fig. 5), but also for the GFP-MinD, and all the MinD mutants. They do refer to some of this data in lines 145-148 but do not show it anywhere. They also refer to "did not result in cell filamentation" in line 213 and to "resulted in highly filamentous cells" and "Introduction of a minC deletion restored cell division" in lines 167-160 without showing the cell length and minicell data, but instead refer to the fluorescence image of the respective strain. I would suggest the authors include this data either in a subpanel in the respective figure or in the supplementary information.

      The effect of uncontrolled MinC activity is very apparent and leads to long filamentous cells. Also the occurrence of minicells is apparent. Cell lengths distribution of wild type cells is shown in Fig. 6B, and minicell formation is negligibly small in wild type cells.

      The transverse fluorescence intensity profiles used as a measure for membrane binding are an average profile from ~30 cells. In the case of the longitudinal profiles that display the gradient, only individual profiles are displayed. I understand that because of distinct cell length, the longitudinal profiles cannot simply be averaged. However, it is possible to project the profiles onto a unit length for averaging (see for example the projection of profiles in McNamara. et al., BioRxiv (2023)). It would be more convincing to average these profiles, which would allow the authors to also quantify the gradients in more detail. If that is impossible, the authors may at least quantify individual valley-to-peak ratios of the longitudinal fluorescence profiles as a measure of the gradient.

      We agree that in future work it would be better to average the profiles as suggested. However, due to limited time and resources, we cannot do this for the current manuscript.

      Regarding the rules and parameters used for the Monte Carlo simulation (see also the corresponding sections in the public review):

      (1) The authors mention that they have not included multimerization of MinD in their simulation but argue in the discussion that it would only strengthen the differences in the diffusion between monomers and multimers. This is correct, but it may also change the membrane residence time and membrane affinity drastically.

      Simulation of multimerization is difficult, but we have now included a simulation whereby MinD dimers can also form tetramers (lines 341-348), shown in the new Fig. 8K. This did not alter the MinD gradient much. 

      (2) The authors implement a dimer-to-monomer transition rate that they equate with the stochastic ATP hydrolysis rate occurring with a half-life of approximately 1/s (line 305). They claim that this rate is based on information from E. coli and cite Huang and Wingreen. However, the Huang paper only mentions the nucleotide exchange rate from ADP to ATP at 1/s. Later that paper cites their use of an ATP hydrolysis rate of 0.7/s to match the E. coli MinDE oscillation rate of 40s. From the authors' statement, it is unclear to me whether they refer to the actual ATP hydrolysis rate in Huang and Wingreen or something else. For E. coli MinD, both the membrane and MinE stimulate ATPase activity. Even if B. subtilis lacks MinE, ATP hydrolysis may still be stimulated by the membrane, which has also been reported in another preprint (Feddersen & Bramkamp, BioRxiv 2024). It may also be stimulated by other components of the Min system like MinJ. The authors should include in the manuscript the Monte Carlo simulation implementing dimer to monomer transition on the membrane only, which is currently referred to only as "(data not shown)". 

      The exact value of the ATP hydrolysis rate is not so important here, so 1/s only gives the order of magnitude (in line with 0.7/s above), which we have now clarified in lines 631-632. We have now also added the “(data not shown” results to Fig. 8, i.e. simulations where dimer to monomer transitions (i.e. ATPase activity) only occurs at the membrane (Fig. 8D & E, and lines 319-322).

      (3) How long did the authors simulate for? How many steps? What timesteps does the average pictured in Figure 7 correspond to?

      We simulated 10^7time steps (corresponding to 100 s in real time). We have checked that the simulation steps for which we average are in steady state. Typical snapshots are recorded after 10^610^7time steps, when the system is in steady state. We have added this information in lines 299-300.

      There are several misconceptions about the (oscillating E. coli) Min system in the main text:

      (1) Lines 77-78: "In case of the E. coli MinD, ATP binding leads to dimerization of MinD, which induces a conformational change in the C-terminal region, thereby exposing an amphiathic helix that functions as a membrane binding domain" and "This shows a clear difference with the E. coli situation, where dimerization of MinD causes a conformational change of the C-terminal region enabling the amphipathic helix to insert into the lipid bilayer" in lines 400-403 are incorrect. There is no evidence that the amphipathic helix at the C-terminus of MinD changes conformation upon ATP binding; several studies have shown instead that a single copy of the amphipathic helix is too weak to confer efficient membrane binding but that the dimerization confers increased membrane binding as now two amphipathic helices are present leading to an avidity effect in membrane binding. Please refer to the following papers (Szeto et al., JBC (2003); Wu et al., Mol Microbiol (2011); Park et al., Cell (2011); Heermann et al., JMB (2020); Loose et al., Nat Struct Mol Biol (2011); Kretschmer et al., ACS Syn Biol (2021); Ramm et al., Nat Commun (2018) or for a better overview the following reviews on the topic of the E. coli Min system Wettmann and Kruse, Philos Trans R Soc B Biol (2018), Ramm et al., Cell and Mol Life Sci (2019); Halatek et al., Philos Trans R SocB Biol Sci (2018).

      This is indeed incorrectly formulated, and we have now amended this in lines 64-66 and lines 403406. Key papers are cited in the text.

      (2) The authors mention that E. coli MinD may multimerize, citing a study where purified MinD was found to polymerize, and then suggest that this is unlikely to be the case in B. subtilis as FRAP recovery of MinD is quick. However, cooperativity in membrane binding is essential to the mathematical models reproducing E. coli Min oscillations, and there is more recent experimental evidence that E. coli MinD forms smaller oligomers that differ in their membrane residence time and diffusion (e.g., Heermann et al., Nat Methods (2023); Heermann et al., JMB (2020);) I would suggest the authors revise the corresponding text sections and test the multimerization in their simulation (see above).

      As mentioned above, simulating oligomerization is difficult, but in order to approximate related cooperative effects, we have simulated a situation whereby MinD dimers can form tetramers. This simulation did not show a large change in MinD gradient formation. We have added the result of this simulation to Fig. 8 (Fig. 8K), and discuss this further in lines 341-348 and 459-467.

      (3) Lines 75-76 and lines 79-80: The sentences "MinC ... and needs to bind to the Walker A-type ATPase MinD for its activity" and "The MinD dimer recruits MinC ... and stimulates its activity" are misleading. MinC is localized by MinD, but MinD does not alter MinC activity, as MinC mislocalization or overexpression also prevents FtsZ ring formation leading to minicell or filamentous cells, as also later described by the authors (line 98). There is also no biochemical evidence that the presence of MinD somehow alters MinC activity towards FtsZ other than a local enrichment on the membrane. I would rephrase the sentence to emphasize that MinD is only localizing MinC but does not alter its activity.   

      We have rephrased this sentence to prevent misinterpretation (lines 66-67).

      Minor points:  

      (1)  I am not quite sure what the experiment with the CCCP shows. The authors explain that MinD binding via the amphipathic helix requires the presence of membrane potential and that the addition of CCCP disturbs binding. They then show that the MinD with two amphipathic helices is not affected by CCCP but the wildtype MinD is. What is the conclusion of this experiment? Would that mean that the MinD with two amphipathic helices binds more strongly, very differently, perhaps non-physiologically?  

      This experiment was “To confirm that the tandem amphipathic helix increased the membrane affinity of MinD”, as mentioned in the beginning of the paragraph (line 224).  

      (2) Lines 456-457: Please cite the FRAP experiment that shows a quick recovery rate of MinD.

      Reference has been added. 

      (3) Figure 4D: It is unclear to me to which condition the p-value brackets point.

      This is related to a statistical t-test. We have added this information to the legend of the figure.

      (4) Line 111, "in the membrane affinity of the MinD". I think that the "the" before MinD should be removed.  

      Corrected

      (5) Typo in line 199 "indicting" instead of indicating.

      Corrected

      (6) Typo in line 220 "reversable" instead of reversible.

      Corrected

      (7) Lines 279, 284, 905: "Monte-Carlo" should read Monte Carlo.

      Corrected

      Reviewer #3 (Recommendations for the authors):  

      Introduction: As written, the introduction does not provide sufficient background for the uninitiated reader to understand the function of the MinCD complex in the context of assembly and activation of cell division in B. subtilis. The introduction is also quite long and would benefit from condensing the description of the Min oscillation mechanism in E. coli to one or two sentences. While highlighting the role of MinE in this system is important for understanding how it works, it is only needed as a counterpoint to the situation in B. subtilis.

      Since the Min system of E. coli is by far the best understood Min system, we feel that it is important to provide detailed information on this system. However, we have added an introductory sentence to explain the key function of the Min system (line 46-48).

      Line 248: Increasing MinD membrane affinity increases the frequency of minicells - however it is unclear if cells are dividing too much or if it is just a Min mutant (i.e. occasionally dividing at the cell pole vs the middle)? Cell length measurements should be included to clarify this point (Figures 4 and 5).

      This information is presented in Fig. 5B (Cell length distribution), which is now Fig. 6B, indicating that the average cell length increases in the tandem alpha helix mutant, a phenotype that is comparable to a MinD knockout. 

      Figure 5: I am a bit confused as to whether increasing MinD affinity doesn't lead to a general block in division by MinCD rather than phenocopying a minD null mutant.

      Although the tandem alpha helix mutant has a cell length distribution comparable to a minD knockout, the tandem mutant produces much less minicells then the minD knockout, indicating that there is still some cell division regulation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an excellent study by a superb investigator who discovered and is championing the field of migrasomes. This study contains a hidden "gem" - the induction of migrasomes by hypotonicity and how that happens. In summary, an outstanding fundamental phenomenon (migrasomes) en route to becoming transitionally highly significant.

      Strengths:

      Innovative approach at several levels. Migrasomes - discovered by Dr Yu's group - are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.

      Weaknesses:

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      We sincerely thank the reviewer for the encouraging and insightful comments. We fully agree that the fundamental aspects of migrasome biology are of great importance and deserve deeper exploration.

      In line with the reviewer’s suggestion, we have expanded our discussion on the basic biology of engineered migrasomes (eMigs). A recent study by the Okochi group at the Tokyo Institute of Technology demonstrated that hypoosmotic stress induces the formation of migrasome-like vesicles, involving cytoplasmic influx and requiring cholesterol for their formation (DOI: 10.1002/1873-3468.14816, February 2024). Building on this, our study provides a detailed characterization of hypoosmotic stressinduced eMig formation, and further compares the biophysical properties of natural migrasomes and eMigs. Notably, the inherent stability of eMigs makes them particularly promising as a vaccine platform.

      Finally, we would like to note that our laboratory continues to investigate multiple aspects of migrasome biology. In collaboration with our colleagues, we recently completed a study elucidating the mechanical forces involved in migrasome formation (DOI: 10.1016/j.bpj.2024.12.029), which further complements the findings presented here.

      Reviewer #2 (Public review):

      Summary:

      The authors' report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle in using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultured, cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARSCoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.

      Strengths:

      The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to form engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done including thermal stability and characterization of the particle size (important characterizations for a good vaccine).

      Weaknesses:

      With a new vaccine platform technology, it would be nice to compare them head-tohead against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome-based vaccine could elicit responses comparable to a proven vaccine technology. 

      We thank the reviewer for the thoughtful evaluation and constructive suggestions, which have helped us strengthen the manuscript. 

      Comparison with proven vaccine technologies:

      In response to the reviewer’s comment, we now include a direct comparison of the antibody responses elicited by eMig-Spike and a conventional recombinant S1 protein vaccine formulated with Alum. As shown in the revised manuscript (Author response image 1), the levels of S1-specific IgG induced by the eMig-based platform were comparable to those induced by the S1+Alum formulation. This comparison supports the potential of eMigs as a competitive alternative to established vaccine platforms. 

      Author response image 1.

      eMigrasome-based vaccination showed similar efficacy compared with adjuvanted recombinant spike protein The amount of S1-specific IgG in mouse serum was quantified by ELISA on day 14 after immunization. Mice were either intraperitoneally (i.p.) immunized with recombinant Alum/S1 or intravenously (i.v.) immunized with eM-NC, eM-S or recombinant S1. The administered doses were 20 µg/mouse for eMigrasomes, 10 µg/mouse (i.v.) or 50 µg/mouse (i.p.) for recombinant S1 and 50 µl/mouse for Aluminium adjuvant.

      Assessment of antigen integrity on migrasomes:

      To address the reviewer’s suggestion regarding antigen integrity, we performed immunoblotting using antibodies against both S1 and mCherry. Two distinct bands were observed: one at the expected molecular weight of the S-mCherry fusion protein, and a higher molecular weight band that may represent oligomerized or higher-order forms of the Spike protein (Figure 5b in the revised manuscript).

      Furthermore, we performed confocal microscopy using a monoclonal antibody against Spike (anti-S). Co-localization analysis revealed strong overlap between the mCherry fluorescence and anti-Spike staining, confirming the proper presentation and surface localization of intact S-mCherry fusion protein on eMigs (Figure 5c in the revised manuscript). These results confirm the structural integrity and antigenic fidelity of the Spike protein expressed on eMigs.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      I know that the reviewers always ask for more, and this is not the case here. Can the abstract and title be changed to emphasize the science behind migrasome formation, and possibly add a few more fundamental aspects on how hypotonic shock induces migrasomes?

      Alternatively, if the authors desire to maintain the emphasis on vaccines, can immunological mechanisms be somewhat expanded in order to - at least to some extent - explain why migrasomes are a better vaccine vehicle?

      One way or another, this reviewer is highly supportive of this study and it is really up to the authors and the editor to decide whether my comments are of use or not.

      My recommendation is to go ahead with publishing after some adjustments as per above.

      We’d like to thank the reviewer for the suggestion. We have changed the title of the manuscript and modified the abstract, emphasizing the fundamental science behind the development of eMigrasome. To gain some immunological information on eMig illucidated antibody responses, we characterized the type of IgG induced by eM-OVA in mice, and compared it to that induced by Alum/OVA. The IgG response to Alum/OVA was dominated by IgG1. Quite differently, eM-OVA induced an even distribution of IgG subtypes, including IgG1, IgG2b, IgG2c, and IgG3 (Figure 4i in the revised manuscript). The ratio between IgG1 and IgG2a/c indicates a Th1 or Th2 type humoral immune response. Thus, eM-OVA immunization induces a balance of Th1/Th2 immune responses.

      Reviewer #2 (Recommendations For The Authors):

      The study is a very nice exploration of a new vaccine platform. This reviewer believes that a more head-to-head comparison to the current vaccine SARS-CoV-2 vaccine platform would improve the manuscript. This comparison is done with OVA antigen, but this model antigen is not as exciting as a functional head-to-head with a SARS-CoV-2 vaccine.

      I think that two other discussion points should be included in the manuscript. First, was the host-cell protein evaluated? If not, I would include that point on how issues of host cell contamination of the migrasome could play a role in the responses and safety of a vaccine. Second, I would discuss antigen incorporation and localization into the platform. For example, the full-length spike being expressed has a native signal peptide and transmembrane domain. The authors point out that a transmembrane domain can be added to display an antigen that does not have one natively expressed, however, without a signal peptide this would not be secreted and localized properly. I would suggest adding a discussion of how a non-native signal peptide would be necessary in addition to a transmembrane domain.

      We thank the reviewer for these thoughtful suggestions and fully agree that the points raised are important for the translational development of eMig-based vaccines.

      (1) Host cell proteins and potential immunogenicity:

      We appreciate the reviewer’s suggestion to consider host cell protein contamination. Considering potential clinical application of eMigrasomes in the future, we will use human cells with low immunogenicity such as HEK-293 or embryonic stem cells (ESCs) to generate eMigrasomes. Also, we will follow a QC that meets the standard of validated EV-based vaccination techniques. 

      (2) Antigen incorporation and localization—signal peptide and transmembrane domain:

      We also agree with the reviewer’s point that proper surface display of antigens on eMigs requires both a transmembrane domain and a signal peptide for correct trafficking and membrane anchoring. For instance, in the case of full-length Spike protein, the native signal peptide and transmembrane domain ensure proper localization to the plasma membrane and subsequent incorporation into eMigs. In case of OVA, a secretary protein that contains a native signal peptide yet lacks a transmembrane domain, an engineered transmembrane domain is required. For antigens that do not naturally contain these features, both a non-native signal peptide and an artificial transmembrane domain are necessary. We have clarified this point in the revised discussion and explicitly noted the requirement for a signal peptide when engineering antigens for surface display on migrasomes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      This work performed Raman spectral microscopy at the single-cell level for 15 different culture conditions in E. coli. The Raman signature is systematically analyzed and compared with the proteome dataset of the same culture conditions. With a linear model, the authors revealed correspondence between Raman pattern and proteome expression stoichiometry indicating that spectrometry could be used for inferring proteome composition in the future. With both Raman spectra and proteome datasets, the authors categorized co-expressed genes and illustrated how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase-dependent genes. Overall, the authors demonstrate a strong and solid data analysis scheme for the joint analysis of Raman and proteome datasets.

      Strengths and major contributions

      (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions.

      (2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning was investigated.

      Weaknesses

      The experimental measurements of Raman microscopy were conducted at the single-cell level; however, the analysis was performed by averaging across the cells. The author did not discuss if Raman microscopy can used to detect cell-to-cell variability under the same condition.

      We thank the reviewer for raising this important point. Though this topic is beyond the scope of our study, some of our authors have addressed the application of single-cell Raman spectroscopy to characterizing phenotypic heterogeneity in individual Staphylococcus aureus cells in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718). Additionally, one of our authors demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, detecting cell-to-cell variability under the same conditions has been shown to be feasible. Whether averaging single-cell Raman spectra is necessary depends on the type of analysis and the available dataset. We will discuss this in more detail in our response to Comment (1) by Reviewer #1 (Recommendation for the authors).

      Discussion and impact on the field

      Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition of biomolecules. It has the advantage that single-cell level data could be acquired and both in vivo and in vitro data can be compared. This work is a strong initiative for introducing the powerful technique to systems biology and providing a rigorous pipeline for future data analysis.

      Reviewer #2 (Public review):

      Summary and strengths:

      Kamei et al. observe the Raman spectra of a population of single E. coli cells in diverse growth conditions. Using LDA, Raman spectra for the different growth conditions are separated. Using previously available protein abundance data for these conditions, a linear mapping from Raman spectra in LDA space to protein abundance is derived. Notably, this linear map is condition-independent and is consequently shown to be predictive for held-out growth conditions. This is a significant result and in my understanding extends the earlier Raman to RNA connection that has been reported earlier.

      They further show that this linear map reveals something akin to bacterial growth laws (ala Scott/Hwa) that the certain collection of proteins shows stoichiometric conservation, i.e. the group (called SCG - stoichiometrically conserved group) maintains their stoichiometry across conditions while the overall scale depends on the conditions. Analyzing the changes in protein mass and Raman spectra under these conditions, the abundance ratios of information processing proteins (one of the large groups where many proteins belong to "information and storage" - ISP that is also identified as a cluster of orthologous proteins) remain constant. The mass of these proteins deemed, the homeostatic core, increases linearly with growth rate. Other SCGs and other proteins are condition-specific.

      Notably, beyond the ISP COG the other SCGs were identified directly using the proteome data. Taking the analysis beyond they then how the centrality of a protein - roughly measured as how many proteins it is stoichiometric with - relates to function and evolutionary conservation. Again significant results, but I am not sure if these ideas have been reported earlier, for example from the community that built protein-protein interaction maps.

      As pointed out, past studies have revealed that the function, essentiality, and evolutionary conservation of genes are linked to the topology of gene networks, including protein-protein interaction networks. However, to the best of our knowledge, their linkage to stoichiometry conservation centrality of each gene has not yet been established.

      Previously analyzed networks, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects from different types of interactions or multilayered regulations affecting each protein species.

      In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions and naturally reflect the global effects of multilayered interactions. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.

      We added a paragraph explaining this important point to the Discussion section, along with additional literature.

      Finally, the paper built a lot of "machinery" to connect ¥Omega_LE, built directly from proteome, and ¥Omega_B, built from Raman, spaces. I am unsure how that helps and have not been able to digest the 50 or so pages devoted to this.

      The mathematical analyses in the supplementary materials form the basis of the argument in the main text. Without the rigorous mathematical discussions, Fig. 6E — one of the main conclusions of this study — and Fig. 7 could never be obtained. Therefore, we believe the analyses are essential to this study. However, we clarified why each analysis is necessary and significant in the corresponding sections of the Results to improve the manuscript's readability.

      Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Strengths:

      The rigorous analysis of the data is the real strength of the paper. Alongside this, the discovery of SCGs that are condition-independent and that are condition-dependent provides a great framework.

      Weaknesses:

      Overall, I think it is an exciting advance but some work is needed to present the work in a more accessible way.

      We edited the main text to make it more accessible to a broader audience. Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Reviewer #1 (Recommendations for the authors):

      (1) The Raman spectral data is measured from single-cell imaging. In the current work, most of the conclusions are from averaged data. From my understanding, once the correspondence between LDA and proteome data is established (i.e. the matrix B) one could infer the single-cell proteome composition from B. This would provide valuable information on how proteome composition fluctuates at the single-cell level.

      We can calculate single-cell proteomes from single-cell Raman spectra in the manner suggested by the reviewer. However, we cannot evaluate the accuracy of their estimation without single-cell proteome data under the same environmental conditions. Likewise, we cannot verify variations of estimated proteomes of single cells. Since quantitatively accurate single-cell proteome data is unavailable, we concluded that addressing this issue was beyond the scope of this study.

      Nevertheless, we agree with the reviewer that investigating how proteome composition fluctuates at the single-cell level based on single-cell Raman spectra is an intriguing direction for future research. In this regard, some of our authors have studied the phenotypic heterogeneity of Staphylococcus aureus cells using single-cell Raman spectra in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718), and one of our authors has demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, it is highly plausible that single-cell Raman spectroscopy can also characterize proteomic fluctuations in single cells. We have added a paragraph to the Discussion section to highlight this important point.

      (2) The establishment of matrix B is quite confusing for readers who only read the main text. I suggest adding a flow chart in Figure 1 to explain the data analysis pipeline, as well as state explicitly what is the dimension of B, LDA matrix, and proteome matrix.

      We thank the reviewer for the suggestion. Following the reviewer's advice, we have explicitly stated the dimensions of the vectors and matrices in the main text. We have also added descriptions of the dimensions of the constructed spaces. Rather than adding another flow chart to Figure 1, we added a new table (Table 1) to explain the various symbols representing vectors and matrices, thereby improving the accessibility of the explanation.

      (3) One of the main contributions for this work is to demonstrate how proteome stoichiometry is regulated across different conditions. A total of m=15 conditions were tested in this study, and this limits the rank of LDA matrix as 14. Therefore, maximally 14 "modes" of differential composition in a proteome can be detected.

      As a general reader, I am wondering in the future if one increases or decreases the number of conditions (say m=5 or m=50) what information can be extracted? It is conceivable that increasing different conditions with distinct cellular physiology would be beneficial to "explore" different modes of regulation for cells. As proof of principle, I am wondering if the authors could test a lower number (by sub-sampling from m=15 conditions, e.g. picking five of the most distinct conditions) and see how this would affect the prediction of proteome stoichiometry inference.

      We thank the reviewer for bringing an important point to our attention. To address the issue raised, we conducted a new subsampling analysis (Fig. S14).

      As we described in the main text (Fig. 6E) and the supplementary materials, the m x m orthogonal matrix, Θ, represents to what extent the two spaces Ω<sub>LE</sub> and Ω<sub>B</sub> are similar (m is the number of conditions; in our main analysis, m = 15). Thus, the low-dimensional correspondence between the two spaces connected by an orthogonal transformation, such as an m-dimensional rotation, can be evaluated by examining the elements of the matrix Θ. Specifically, large off-diagonal elements of the matrix  mix higher dimensions and lower dimensions, making the two spaces spanned by the first few major axes appear dissimilar. Based on this property, we evaluated the vulnerability of the low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> to the reduced number of conditions by measuring how close Θ was to the identity matrix when the analysis was performed on the subsampled datasets.

      In the new figure (Fig. S14), we first created all possible smaller condition sets by subsampling the conditions. Next, to evaluate the closeness between the matrix Θ and the identity matrix for each smaller condition set, we generated 10,000 random orthogonal matrices of the same size as . We then evaluated the probability of obtaining a higher level of low-dimensional correspondence than that of the experimental data by chance (see section 1.8 of the Supplementary Materials). This analysis was already performed in the original manuscript for the non-subsampled case (m = 15) in Fig. S9C; the new analysis systematically evaluates the correspondence for the subsampled datasets.

      The results clearly show that low-dimensional correspondence is more likely to be obtained with more conditions (Fig. S14). In particular, when the number of conditions used in the analysis exceeds five, the median of the probability that random orthogonal matrices were closer to the identity matrix than the matrix Θ calculated from subsampled experimental data became lower than 10<sup>-4</sup>. This analysis provides insight into the number of conditions required to find low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub>.

      What conditions are used in the analysis can change the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> . Therefore, it is important to clarify whether including more conditions in the analysis reduces the dependence of the low-dimensional structures on conditions. We leave this issue as a subject for future study. This issue relates to the effective dimensionality of omics profiles needed to establish the diverse physiological states of cells across conditions. Determining the minimum number of conditions to attain the condition-independent low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> would provide insight into this fundamental problem. Furthermore, such an analysis would identify the range of applications of Raman spectra as a tool for capturing macroscopic properties of cells at the system level.

      We now discuss this point in the Discussion section, referring to this analysis result (Fig. S14). Please also see our reply to the comment (1) by Reviewer #2 (Recommendations for the authors).

      (4) In E. coli cells, total proteome is in mM concentration while the total metabolites are between 10 to 100 mM concentration. Since proteins are large molecules with more functional groups, they may contribute to more Raman signal (per molecules) than metabolites. Still, the meaningful quantity here is the "differential Raman signal" with different conditions, not the absolute signal. I am wondering how much percent of differential Raman signature are from proteome and how much are from metabolome.

      It is an important and interesting question to what extent changes in the proteome and metabolome contribute to changes in Raman spectra. Though we concluded that answering this question is beyond the scope of this study, we believe it is an important topic for future research.

      Raman spectral patterns convey the comprehensive molecular composition spanning the various omics layers of target cells. Changes in the composition of these layers can be highly correlated, and identifying their contributions to changes in Raman spectra would provide insight into the mutual correlation of different omics layers. Addressing the issue raised by the reviewer would expand the applications of Raman spectroscopy and highlight the advantage of cellular Raman spectra as a means of capturing comprehensive multi-omics information.

      We note that some studies have evaluated the contributions of proteins, lipids, nucleic acids, and glycogen to the Raman spectra of mammalian cells and how these contributions change in different states (e.g., Mourant et al., J Biomed Opt, 10(3), 031106, 2005). Additionally, numerous studies have imaged or quantified metabolites in various cell types (see, for example, Cutshaw et al., Chemical Reviews, 123(13), 8297–8346, 2023, for a comprehensive review). Extending these approaches to multiple omics layers in future studies would help resolve the issue raised by the reviewer.

      (5) It is known that E. coli cells in different conditions have different cell sizes, where cell width increases with carbon source quality and growth rate. Does this effect be normalized when processing the Raman signal?

      Each spectrum was normalized by subtracting the average and dividing it by the standard deviation. This normalization minimizes the differences in signal intensities due to different cell sizes and densities. This information is shown in the Materials and Methods section of the Supplementary Materials.

      (6) I have a question about interpretation of the centrality index. A higher centrality indicates the protein expression pattern is more aligned with the "mainstream" of the other proteins in the proteome. However, it is possible that the proteome has multiple" mainstream modes" (with possibly different contributions in magnitudes), and the centrality seems to only capture the "primary mode". A small group of proteins could all have low centrality but have very consistent patterns with high conservation of stoichiometry. I wondering if the author could discuss and clarify with this.

      We thank the reviewer for drawing our attention to the insufficient explanation in the original manuscript. First, we note that stoichiometry conserving protein groups are not limited to those composed of proteins with high stoichiometry conservation centrality. The SCGs 2–5 are composed of proteins that strongly conserve stoichiometry within each group but have low stoichiometry conservation centrality (Fig. 5A, 5K, 5L, and 7A). In other words, our results demonstrate the existence of the "primary mainstream mode" (SCG 1, i.e., the homeostatic core) and condition-specific "non-primary mainstream modes" (SCGs 2–5). These primary and non-primary modes are distinguishable by their position along the axis of stoichiometry conservation centrality (Fig. 5A, 5K, and 5L).

      However, a single one-dimensional axis (centrality) cannot capture all characteristics of stoichiometry-conserving architecture. In our case, the "non-primary mainstream modes" (SCGs 2–5) were distinguished from each other by multiple csLE axes.

      To clarify this point, we modified the first paragraph of the section where we first introduce csLE (Revealing global stoichiometry conservation architecture of the proteomes with csLE). We also added a paragraph to the Discussion section regarding the condition-specific SCGs 2–5.

      (7) Figures 3, 4, and 5A-I are analyses on proteome data and are not related to Raman spectral data. I am wondering if this part of the analysis can be re-organized and not disrupt the mainline of the manuscript.

      We agree that the structure of this manuscript is complicated. Before submitting this manuscript to eLife, we seriously considered reorganizing it. However, we concluded that this structure was most appropriate because our focus on stoichiometry conservation cannot be explained without analyzing the coefficients of the Raman-proteome correspondence using COG classification (see Fig. 3; note that Fig. 3A relates to Raman data). This analysis led us to examine the global stoichiometry conservation architecture of proteomes (Figs. 4 and 5) and discover the unexpected similarity between the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub>

      Therefore, we decided to keep the structure of the manuscript as it is. To partially resolve this issue, however, we added references to Fig. S1, the diagram of this paper’s mainline, to several places in the main text so that readers can more easily grasp the flow of the manuscript.

      (8) Supplementary Equation (2.6) could be wrong. From my understanding of the coordinate transformation definition here, it should be [w1 ... ws] X := RHS terms in big parenthesis.

      We checked the equation and confirmed that it is correct.

      Reviewer #2 (Recommendations for the authors):

      (1) The first main result or linear map between raman and proteome linked via B is intriguing in the sense that the map is condition-independent. A speculative question I have is if this relationship may become more complex or have more condition-dependent corrections as the number of conditions goes up. The 15 or so conditions are great but it is not clear if they are often quite restrictive. For example, they assume an abundance of most other nutrients. Now if you include a growth rate decrease due to nitrogen or other limitations, do you expect this to work?

      In our previous paper (Kobayashi-Kirschvink et al., Cell Systems 7(1): 104–117.e4, 2018), we statistically demonstrated a linear correspondence between cellular Raman spectra and transcriptomes for fission yeast under 10 environmental conditions. These conditions included nutrient-rich and nutrient-limited conditions, such as nitrogen limitation. Since the Raman-transcriptome correspondence was only statistically verified in that study, we analyzed the data from the standpoint of stoichiometry conservation in this study. The results (Fig. S11 and S12) revealed a correspondence in lower dimensions similar to that observed in our main results. In addition, similar correspondences were obtained even for different E. coli strains under common culture conditions (Fig. S11 and S12). Therefore, it is plausible that the stoichiometry-conservation low-dimensional correspondence between Raman and gene expression profiles holds for a wide range of external and internal perturbations.

      We agree with the reviewer that it is important to understand how Raman-omics correspondences change with the number of conditions. To address this issue, we examined how the correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> changes by subsampling the conditions used in the analysis. We focused on , which was introduced in Fig. 5E, because the closeness of Θ to the identity matrix represents correspondence precision. We found a general trend that the low-dimensional correspondence becomes more precise as the number of conditions increases (Fig. S14). This suggests that increasing the number of conditions generally improves the correspondence rather than disrupting it.

      We added a paragraph to the Discussion section addressing this important point. Please also refer to our response to Comment (3) of Reviewer #1 (Recommendations for the authors).

      (2) A little more explanation in the text for 3C/D would help. I am imagining 3D is the control for 3C. Minor comment - 3B looks identical to S4F but the y-axis label is different.

      We thank the reviewer for pointing out the insufficient explanation of Fig. 3C and 3D in the main text. Following this advice, we added explanations of these plots to the main text. We also added labels ("ISP COG class" and "non-ISP COG class") to the top of these two figures.

      Fig. 3B and S4F are different. For simplicity, we used the Pearson correlation coefficient in Fig. 3B. However, cosine similarity is a more appropriate measure for evaluating the degree of conservation of abundance ratios. Thus, we presented the result using cosine similarity in a supplementary figure (Fig. S4F). Please note that each point in Fig. S4F is calculated between proteome vectors of two conditions. The dimension of each proteome vector is the number of genes in each COG class.

      (3) Can we see a log-log version of 4C to see how the low-abundant proteins are behaving? In fact, the same is in part true for Figure 3A.

      We added the semi-log version of the graph for SCG1 (the homeostatic core) in Fig. 4C to make low-abundant proteins more visible. Please note that the growth rates under the two stationary-phase conditions were zero; therefore, plotting this graph in log-log format is not possible.

      Fig. 3A cannot be shown as a log-log plot because many of the coefficients are negative. The insets in the graphs clarify the points near the origin.

      (4) In 5L, how should one interpret the other dots that are close to the center but not part of the SCG1? And this theme continues in 6ACD and 7A.

      The SCGs were obtained by setting a cosine similarity threshold. Therefore, proteins that are close to SCG 1 (the homeostatic core) but do not belong to it have a cosine similarity below the threshold with any protein in SCG 1. Fig. 7 illustrates the expression patterns of the proteins in question.

      (5) Finally, I do not fully appreciate the whole analysis of connecting ¥Omega_csLE and ¥Omega_B and plots in 6 and 7. This corresponds to a lot of linear algebra in the 50 or so pages in section 1.8 in the supplementary. If the authors feel this is crucial in some way it needs to be better motivated and explained. I philosophically appreciate developing more formalism to establish these connections but I did not understand how this (maybe even if in the future) could lead to a new interpretation or analysis or theory.

      The mathematical analyses included in the supplementary materials are important for readers who are interested in understanding the mathematics behind our conclusions. However, we also thought these arguments were too detailed for many readers when preparing the original submission and decided to show them in the supplemental materials.

      To better explain the motivation behind the mathematical analyses, we revised the section “Representing the proteomes using the Raman LDA axes”.

      Please also see our reply to the comment (6) by Reviewer #2 (Recommendations for the authors) below.

      (6) Along the lines of the previous point, there seems to be two separate points being made: a) there is a correspondence between Raman and proteins, and b) we can use the protein data to look at centrality, generality, SCGs, etc. And the two don't seem to be linked until the formalism of ¥Omegas?

      The reviewer is correct that we can calculate and analyze some of the quantities introduced in this study, such as stoichiometry conservation centrality and expression generality, without Raman data. However, it is difficult to justify introducing these quantities without analyzing the correspondence between the Raman and proteome profiles. Moreover, the definition of expression generality was derived from the analysis of Raman-proteome correspondence (see section 2.2 of the Supplementary Materials). Therefore, point b) cannot stand alone without point a) from its initial introduction.

      To partially improve the readability and resolve the issue of complicated structure of this manuscript, we added references to Fig. S1, which is a diagram of the paper’s mainline, to several places in the main text. Please also see our reply to the comment (7) by Reviewer #1 (Recommendations for the authors).

    1. Author response:

      We would like to thank the three Reviewers for their thoughtful comments and detailed feedback. We are pleased to hear that the Reviewers found our paper to be “providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise” (R1), “well-suited to test evidence for predictive coding versus alternative hypotheses” (R2), and “timely and interesting” (R3).

      We perceive that the reviewers have an overall positive impression of the experiments and analyses, but find the text somewhat dense and would like to see additional statistical rigor, as well as in some cases additional analyses to be included in supplementary material. We therefore here provide a provisional letter addressing revisions we have already performed and outlining the revision we are planning point-by-point. We begin each enumerated point with the Reviewer’s quoted text and our responses to each point are made below.

      Reviewer 1:

      (1) Introduction:

      The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?”

      We have edited the text to indicate that theta-band activity has been related to prediction error processing as an empirical observation, and must regrettably leave drawing inferences about its functional role to future work, with experiments designed specifically to draw out theta-band activity.

      (2) Limited propagation of gamma band signals:

      Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?”

      We have not specifically claimed that gamma propagates between columns/areas in our recordings, only that it synchronizes synaptic current flows between laminar layers within a column/area. We nonetheless suggest that gamma can locally synchronize a column, and potentially local columns within an area via entrainment of local recurrent spiking, to update an internal prediction/representation upon onset of a prediction error. We also point the Reviewer to our Discussion section, where we state that our results fit with a model “whereby θ oscillations synchronize distant areas, enabling them to exchange relevant signals during cognitive processing.” In our present work, we therefore remain agnostic about whether theta or gamma or both (or alternative mechanisms) are at play in terms of how prediction error signals are transmitted between areas.

      (3) Paradigm:

      While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).

      A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).

      I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.”

      We appreciate the Reviewer’s concern about working memory limitations in mice. Our paradigm and training followed on from previous paradigms such as Gavornik and Bear (2014), in which predictive effects were observed in mouse V1 with presentation times of 150ms and interstimulus intervals of 1500ms. In addition, we note that Jamali et al. (2024) recently utilized a similar global/local paradigm in the auditory domain with inter-sequence intervals as long as 28-30 seconds, and still observed effects of a predicted sequence (https://elifesciences.org/articles/102702). For the revised manuscript, we plan to expand on this in the Discussion section.

      That being said, as the Reviewer also pointed out, this would be a greater concern had we not found any positive findings in our study. However, even with the rather long sequence periods we used, we did find positive evidence for predictive effects, supporting the use of our current paradigm. We agree with the reviewer that these positive effects are easier to interpret than negative effects, and plan to expand upon this in the Discussion when we resubmit.

      (4) Reporting of results:

      I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.”

      For the revised manuscript, we can include the p-values after cluster-based testing for each significant cluster, as well as show data that passes a more stringent threshold of p<0.001 (1/1000) or p<0.005 (1/200) rather than our present p<0.01 (1/100).

      (5) Cluster test:

      The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?”

      We are happy to include a 3D plot of a time-channel-frequency cluster in the revised manuscript to clarify our statistical approach for the reviewer. We consider our current three-dimensional cluster-testing an “unsupervised” way of uncovering significant contrasts with no theory-driven assumptions about which bounded frequency bands or layers do what.

      Reviewer 2:

      Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).

      (1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)

      (2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and

      (3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).

      They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.

      While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:”

      We appreciate the reviewer’s concerns and outline how we will address them below:

      (1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.”

      We have clarified in the manuscript that while the gamma-as-prediction hypothesis (our H2) was originally proposed in a spatial prediction domain, further work (specifically Singer (2021)) has extended the hypothesis to cover temporal-domain predictions as well.

      To address the reviewer’s point about multiple features in the spectral domain: Our analysis has specifically separated aperiodic components using FOOOF analysis (Supp. Fig. 1) and explicitly fit and tested aperiodic vs. periodic components (Supp. Figs 1&2). We did not find strong effects in the aperiodic components but did in the periodic components (Supp. Fig. 2), allowing us to be more confident in our conclusions in terms of genuine narrow-band oscillations. In the revised manuscript, we will include analysis of the pre-stimulus time window to address the reviewer’s point (iv) on sustained low frequency oscillations.

      (2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.

      In our revised manuscript we will include a pre-stimulus analysis and supplementary figures with time ranges from -500ms to 500ms. We have only refrained from doing so in the initial manuscript because our paradigm’s short interstimulus interval makes it difficult to interpret whether activity in the ISI reflects post-stimulus dynamics or pre-stimulus prediction. Nonetheless, we can easily show that in our paradigm, alpha/beta-band activity is elevated in the interstimulus activity after the offset of the previous stimulus, assuming that we baseline to the pre-trial period.

      (3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).

      We have included an extra sentence in our Materials and Methods section clarifying that the evoked potential/ERP was removed in our existing analyses, prior to performing the spectral decomposition of the LFP signal. We also note that the FOOOF analysis we applied separates aperiodic components of the spectral signal from the strictly oscillatory ones.

      In our revised manuscript we will include an analysis of the evoked responses as suggested by the reviewer.

      (4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?

      As noted above to Reviewer 1 (point 4), we are happy to include supplemental figures in our resubmission showing the effects on our results of setting the statistical significance threshold with considerably greater stringency.

      (5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.

      In order to check for the brain-wide effects of arousal, we plan to perform similar analyses to our existing ones on the 3rd stimulus in each block, rather than just the 4th “oddball” stimulus. Clusters that appear significantly contrasting in both the 3rd and 4th stimuli may be attributable to arousal.  We will also analyze pupil size as an index of arousal to check for arousal differences between conditions in our contrasts, possibly stratifying our data before performing comparisons to equalize pupil size within contrasts. We plan to include these analyses in our resubmission.

      (6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.

      Since many different predictive coding and predictive processing hypotheses make very different hypotheses about how predictions might encoded in neurophysiological recordings, we have focused on prediction error encoding in this paper.

      For the hypothesis space we have considered (H1-H3), each hypothesis makes clearly distinguishable predictions about the spectral response during the time period in the task when prediction errors should be present. As noted by the reviewer, a transient increase in broadband frequencies would be a signature of H3. Changes to oscillatory power in the gamma band in distinct directions (e.g., increasing or decreasing with prediction error) would support either H1 and H2, depending on the direction of change. We believe our data, especially our use of FOOOF analysis and separation of periodic from aperiodic components, coupled to the three experimental contrasts, speaks clearly in favor of the Predictive Routing model, but we do not claim we have “proved” it. This study provides just one datapoint, and we will acknowledge this in our revised Discussion in our resubmission.

      (7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.

      We consider a feedforward pattern as flowing from L4 outwards to L2/3 and L5/6, and a feedback pattern as flowing in the opposite direction, from L1 and L6 to the middle layers. We will clarify this in the revised manuscript.

      Since gamma-band oscillations are strongest in L2/3, we re-epoched LFPs to the oscillation troughs in L2/3 in the initial manuscript. We can include in the revised manuscript equivalent plots after finding oscillation troughs in L4 instead, as well as calculating the difference in trough times within-band between layers to quantify the transmission delay and add additional rigor to our feedforward vs. feedback interpretation of the CSD data.

      (8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?

      We are looking into the clim/colorbar and plot-generation code to figure out the visibility issue that the Reviewer has kindly pointed out to us.

      (9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.

      We will add the requested bar-plots for all frequency ranges, though we note that the bars given here are the results of adding up the spectral power in the channel-time-frequency clusters that already passed significance tests and that adding secondary significance tests here may not prove informative.

      (10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.

      As noted above, we will include the requested bar plot, as well as examining alpha/beta in the pre-stimulus time-series rather than after the onset of the oddball stimulus.

      (11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.

      We will include for the Reviewer’s edification a supplementary figure showing the spectrograms for the baseline and full-trial periods to look at the difference between baseline and prestimulus expectation.

      Reviewer 3:

      Summary:

      In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.

      Strengths:

      (1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.

      (2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.”

      We thank the Reviewer for their kind comments.

      Weaknesses:

      (1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).

      We agree that in this manuscript we should restrict ourselves to the hypotheses that were directly tested. We have revised our abstract accordingly,  and softened our claim to note only that our LFP results are compatible with predictive routing.

      (2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.

      In our revised manuscript, we will either substantiate (with quantification of CSD delays between layers) or soften the claims about feedforward/feedback direction of flow within the cortical column.

      (3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.

      In our revised manuscript we will gladly include a supplementary figure showing the right-column difference plots across areas, in order to make sure to include aspects of our dataset that span up and down the cortical hierarchy.

      (4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).

      We appreciate the Reviewer’s concern for statistical rigor, and as noted to the other reviewers, we can add different levels of statistical description and describe the p-values associated with specific clusters. Regarding Figure 5, we must protest as the bar heights were computed came from clusters already subjected to statistical testing and found significant.  We could add a supplementary figure which considers untested narrowband activity and tests it only in the “bar height” domain, if the Reviewer would like.

      (5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.

      We have added some elaboration to our Materials and Methods section, especially to specify that CSD, having physical rather than arbitrary units, does not require baselining.

      (6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.

      In the revised manuscript we will add a visual aid for the three contrasts we consider.

      We are happy to inform the editors that we have implemented, for the Reviewed Preprint, the direct textual Recommendations for the Authors given by Reviewers 2 and 3. We will implement the suggested Figure changes in our revised manuscript. We thank them for their feedback in strengthening our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from the entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.

      Strengths:

      (1) The overall topic is very interesting and timely and the manuscript is well-written.

      (2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.

      (3) The findings provide new insights into generalizable neural representations of abstract task states in the entorhinal cortex.

      We thank the reviewer for their kind comments and clear summary of the paper and its strengths.

      Weaknesses:

      (1) The manuscript would benefit from improving the figures. Moreover, the clarity could be strengthened by including conceptual/schematic figures illustrating the logic and steps of the method early in the paper. This could be combined with an illustration of the remapping properties of grid and place cells and how the method captures these properties.

      We agree with the reviewer and have added a schematic figure of the method (figure 1a).

      (2) Hexagonal and community structures appear to be confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could thus be explained (in theory) by order effects (although this is practically unlikely). However, given community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why the authors did not find generalizations across graph sizes for community structures.

      We thank the reviewer for their comments. We agree that the null result regarding the community structures does not mean that EC doesn’t generalise over these structures, and that the training order could in theory contribute to the lack of an effect. The decision to keep the asymmetry of the training order was deliberate: we chose this order based on our previous study (Mark et al. 2020), where we show that learning a community structure first changes the learning strategy of subsequent graphs. We could have perhaps overcome this by increasing the training periods, but 1) the training period is already very long; 2) there will still be asymmetry because the group that first learn community structure will struggle in learning the hexagonal graph more than vice versa, as shown in Mark et al. 2020.

      We have added the following sentences on this decision to the Methods section:

      “We chose to first teach hexagonal graphs for all participants and not randomize the order because of previous results showing that first learning community structure changes participants’ learning strategy (mark et al. 2020).”

      (3) The authors include the results from a searchlight analysis to show the specificity of the effects of EC. A better way to show specificity would be to test for a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC).

      Thanks for this suggestion. We indeed tried to run the analysis in a whole-ROI approach, but this did not result in a significant effect in EC. Importantly, we disagree with the reviewer that this is a “better way to show specificity” than the searchlight approach. In our view, the two analyses differ with respect to the spatial extent of the representation they test for. The searchlight approach is testing for a highly localised representation on the scale of small spheres with only 100 voxels. The signal of such a localised representation is likely to be drowned in the noise in an analysis that includes thousands of voxels which mostly don’t show the effect - as would be the case in the whole-ROI approach.

      (4) Subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another confound, and possible reason why there was no generalization across stimulus sets for the community structure.

      See our response to comment (2).

      Reviewer #2 (Public review):

      Summary:

      Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was no evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal entorhinal representations that emerge across specific shared structural conditions.

      Strengths:

      I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of a random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.

      We thank the reviewer for their kind comments and the clear summary of our paper.

      Weaknesses:

      In part because of the thorough validation procedures, the paper came across to me as a bit of a hybrid between a methods paper and an empirical one. However, I have some concerns, both on the methods development/validation side, and on the empirical application side, which I believe limit what one can take away from the studies performed.

      We thank the reviewer for the comment. We agree that the paper comes across as a bit of a methods-empirical hybrid. We chose to do this because we believe (as the reviewer also points out) that there is value in both aspects of the paper.

      Regarding the methods side, while I can appreciate that the authors show how the subspace generalization method "could" identify representations of theoretical interest, I felt like there was a noticeable lack of characterization of the specificity of the method. Based on the main equation in the results section of the paper, it seems like the primary measure used here would be sensitive to overall firing rates/voxel activations, variance within specific neurons/voxels, and overall levels of correlation among neurons/voxels. While I believe that reasonable pre-processing strategies could deal with the first two potential issues, the third seems a bit more problematic - as obligate correlations among neurons/voxels surely exist in the brain and persist across context boundaries that are not achieving any sort of generalization (for example neurons that receive common input, or voxels that share spatial noise). The comparative approach (ie. computing difference in the measure across different comparison conditions) helps to mitigate this concern to some degree - but not completely - since if one of the conditions pushes activity into strongly spatially correlated dimensions, as would be expected if univariate activations were responsive to the conditions, then you'd expect generalization (driven by shared univariate activation of many voxels) to be specific to that set of conditions.

      We thank the reviewer for their comments. We would like to point out that we demean each voxel within all states/piles (3-pictures sequences) in a given graph/task (what the reviewer is calling “a condition”). Hence there is no shared univariate activation of many voxels in response to a graph going into the computation, and no sensitivity to the overall firing rate/voxel activation.  Our calculation captures the variance across states conditions within a task (here a graph), over and above the univariate effect of graph activity. In addition, we spatially pre-whiten the data within each searchlight, meaning that noisy voxels with high noise variance will be downweighted and noise correlations between voxels are removed prior to applying our method.

      A second issue in terms of the method is that there is no comparison to simpler available methods. For example, given the aims of the paper, and the introduction of the method, I would have expected the authors to take the Neuron-by-Neuron correlation matrices for two conditions of interest, and examine how similar they are to one another, for example by correlating their lower triangle elements. Presumably, this method would pick up on most of the same things - although it would notably avoid interpreting high overall correlations as "generalization" - and perhaps paint a clearer picture of exactly what aspects of correlation structure are shared. Would this method pick up on the same things shown here? Is there a reason to use one method over the other?

      We thank the reviewer for this important and interesting point. We agree that calculating correlation between the upper triangular elements of the covariance or correlation matrices picks up similar, but not identical aspects of the data (see below the mathematical explanation that was added to the supplementary). When we repeated the searchlight analysis and calculated the correlation between the upper triangular entries of the Pearson correlation matrices we obtained an effect in the EC, though weaker than with our subspace generalization method (t=3.9, the effect did not survive multiple comparisons). Similar results were obtained with the correlation between the upper triangular elements of the covariance matrices(t=3.8, the effect did not survive multiple comparisons).

      The difference between the two methods is twofold: 1) Our method is based on the covariance matrix and not the correlation matrix - i.e. a difference in normalisation. We realised that in the main text of the original paper we mistakenly wrote “correlation matrix” rather than “covariance matrix” (though our equations did correctly show the covariance matrix). We have corrected this mistake in the revised manuscript. 2) The weighting of the variance explained in the direction of each eigenvector is different between the methods, with some benefits of our method for identifying low-dimensional representations and for robustness to strong spatial correlations.  We have added a section “Subspace Generalisation vs correlating the Neuron-by-Neuron correlation matrices” to the supplementary information with a mathematical explanation of these differences.

      Regarding the fMRI empirical results, I have several concerns, some of which relate to concerns with the method itself described above. First, the spatial correlation patterns in fMRI data tend to be broad and will differ across conditions depending on variability in univariate responses (ie. if a condition contains some trials that evoke large univariate activations and others that evoke small univariate activations in the region). Are the eigenvectors that are shared across conditions capturing spatial patterns in voxel activations? Or, related to another concern with the method, are they capturing changing correlations across the entire set of voxels going into the analysis? As you might expect if the dynamic range of activations in the region is larger in one condition than the other?

      This is a searchlight analysis, therefore it captures the activity patterns within nearby voxels. Indeed, as we show in our simulation, areas with high activity and therefore high signal to noise will have better signal in our method as well. Note that this is true of most measures.

      My second concern is, beyond the specificity of the results, they provide only modest evidence for the key claims in the paper. The authors show a statistically significant result in the Entorhinal Cortex in one out of two conditions that they hypothesized they would see it. However, the effect is not particularly large. There is currently no examination of what the actual eigenvectors that transfer are doing/look like/are representing, nor how the degree of subspace generalization in EC may relate to individual differences in behavior, making it hard to assess the functional role of the relationship. So, at the end of the day, while the methods developed are interesting and potentially useful, I found the contributions to our understanding of EC representations to be somewhat limited.

      We agree with this point, yet believe that the results still shed light on EC functionality. Unfortunately, we could not find correlation between behavioral measures and the fMRI effect.

      Reviewer #3 (Public review):

      Summary:

      The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.

      Strengths:

      (1) Empirical Support: The study provides strong empirical evidence for the theoretical and computational neuroscience argument about the EC's role in structure generalization.

      (2) Novel Approach: The research uses an innovative methodology and applies the same methods to three independent data sets, enhancing the robustness and reliability of the findings.

      (3) Controlled Analysis: The results are robust against well-controlled data and/or permutations.

      (4) Generalizability: By integrating data from different sources, the study offers a comprehensive understanding of the EC's role, strengthening the overall evidence supporting structural generalization across different task environments.

      Weaknesses:

      A potential criticism might arise from the fact that the authors applied innovative methods originally used in animal electrophysiology data (Samborska et al., 2022) to noisy fMRI signals. While this is a valid point, it is noteworthy that the authors provide robust simulations suggesting that the generalization properties in EC representations can be detected even in low-resolution, noisy data under biologically plausible assumptions. I believe this is actually an advantage of the study, as it demonstrates the extent to which we can explore how the brain generalizes structural knowledge across different task environments in humans using fMRI. This is crucial for addressing the brain's ability in non-spatial abstract tasks, which are difficult to test in animal models.

      While focusing on the role of the EC, this study does not extensively address whether other brain areas known to contain grid cells, such as the mPFC and PCC, also exhibit generalizable properties. Additionally, it remains unclear whether the EC encodes unique properties that differ from those of other systems. As the authors noted in the discussion, I believe this is an important question for future research.

      We thank the reviewer for their comments. We agree with the reviewer that this is a very interesting question. We tried to look for effects in the mPFC, but we did not obtain results that were strong enough to report in the main manuscript, but we do report a small effect in the supplementary.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I wonder how important the PCA on B1(voxel-by-state matrix from environment 1) and the computation of the AUC (from the projection on B2 [voxel-by-state matrix from environment 1]) is for the analysis to work. Would you not get the same result if you correlated the voxel-by-voxel correlation matrix based on B1 (C1) with the voxel-by-voxel correlation matrix based on B2 (C2)? I understand that you would not have the subspace-by-subspace resolution that comes from the individual eigenvectors, but would the AUC not strongly correlate with the correlation between C1 and C2?

      We agree with the reviewer comments - see our response to reviewer 2 second issue above. 

      (2) There is a subtle difference between how the method is described for the neural recording and fMRI data. Line 695 states that principal components of the neuron x neuron intercorrelation matrix are computed, whereas line 888 implies that principal components of the data matrix B are computed. Of note, B is a voxel x pile rather than a pile x voxel matrix. Wouldn't this result in U being pile x pile rather than voxel x voxel?

      The PCs are calculated on the neuron x neuron (or voxel x voxel) covariance matrix of the activation matrix. We’ve added the following clarification to the relevant part of the Methods:

      “We calculated noise normalized GLM betas within each searchlight using the RSA toolbox. For each searchlight and each graph, we had a nVoxels (100) by nPiles (10) activation matrix (B) that describes the activation of a voxel as a result of a particular pile (three pictures’ sequence). We exploited the (voxel x voxel) covariance matrix of this matrix to quantify the manifold alignment within each searchlight.”

      (3) It would be very helpful to the field if the authors would make the code and data publicly available. Please consider depositing the code for data analysis and simulations, as well as the preprocessed/extracted data for the key results (rat data/fMRI ROI data) into a publicly accessible repository.

      The code is publicly available in git (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

      (4) Line 219: "Kolmogorov Simonov test" should be "Kolmogorov Smirnov test".

      thanks!

      (5) Please put plots in Figure 3F on the same y-axis.

      (6) Were large and small graphs of a given statistical structure learned on the same days, and if so, sequentially or simultaneously? This could be clarified.

      The graphs are learned on the same day.  We clarified this in the Methods section.

      Reviewer #2 (Recommendations for the authors):

      Perhaps the advantage of the method described here is that you could narrow things down to the specific eigenvector that is doing the heavy lifting in terms of generalization... and then you could look at that eigenvector to see what aspect of the covariance structure persists across conditions of interest. For example, is it just the highest eigenvalue eigenvector that is likely picking up on correlations across the entire neural population? Or is there something more specific going on? One could start to get at this by looking at Figures 1A and 1C - for example, the primary difference for within/between condition generalization in 1C seems to emerge with the first component, and not much changes after that, perhaps suggesting that in this case, the analysis may be picking up on something like the overall level of correlations within different conditions, rather than a more specific pattern of correlations.

      The nature of the analysis means the eigenvectors are organized by their contribution to the variance, therefore the first eigenvector is responsible for more variance than the other, we did not check rigorously whether the variance is then splitted equally by the remaining eigenvectors but it does not seems to be the case.

      Why is variance explained above zero for fraction EVs = 0 for figure 1C (but not 1A) ? Is there some plotting convention that I'm missing here?

      There was a small bug in this plot and it was corrected - thank you very much!

      The authors say:

      "Interestingly, the difference in AUCs was also 190 significantly smaller than chance for place cells (Figure 1a, compare dotted and solid green 191 lines, p<0.05 using permutation tests, see statistics and further examples in supplementary 192 material Figure S2), consistent with recent models predicting hippocampal remapping that is 193 not fully random (Whittington et al. 2020)."

      But my read of the Whittington model is that it would predict slight positive relationships here, rather than the observed negative ones, akin to what one would expect if hippocampal neurons reflect a nonlinear summation of a broad swath of entorhinal inputs.

      Smaller differences than chance imply that the remapping of place cells is not completely random.

      Figure 2:

      I didn't see any description of where noise amplitude values came from - or any justification at all in that section. Clearly, the amount of noise will be critical for putting limits on what can and cannot be detected with the method - I think this is worthy of characterization and explanation. In general, more information about the simulations is necessary to understand what was done in the pseudovoxel simulations. I get the gist of what was done, but these methods should clear enough that someone could repeat them, and they currently are not.

      Thanks, we added noise amplitude to the figure legend and Methods.

      What does flexible mean in the title? The analysis only worked for the hexagonal grid - doesn't that suggest that whatever representations are uncovered here are not flexible in the sense of being able to encode different things?

      Flexible here means, flexible over stimulus’ characteristics that are not related to the structural form such as stimuli, the size of the graph etc.

      Reviewer #3 (Recommendations for the authors):

      I have noticed that the authors have updated the previous preprint version to include extensive simulations. I believe this addition helps address potential criticisms regarding the signal-to-noise ratio. If the authors could share the code for the fMRI data and the simulations in an open repository, it would enhance the study's impact by reaching a broader readership across various research fields. Except for that, I have nothing to ask for revision.

      Thanks, the code will be publicly available: (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03160

      Corresponding author(s) Padinjat, Raghu

      [The “revision plan” should delineate the revisions that authors intend to carry out in response to the points raised by the referees. It also provides the authors with the opportunity to explain their view of the paper and of the referee reports.

      • *

      The document is important for the editors of affiliate journals when they make a first decision on the transferred manuscript. It will also be useful to readers of the reprint and help them to obtain a balanced view of the paper.

      • *

      If you wish to submit a full revision, please use our "Full Revision" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      We thank all three reviewers for appreciating the novelty of our analysis of CERT function in a physiological context in vivo. While many studies have been published on the biochemistry and function of CERT in cultured cells, there are limited studies, if any, relating the impact of CRT function at the biochemical level to its function on a physiological process, in our case the electrical response to light.

      We also that all reviewers for commenting on the importance of our rescue of dcert mutants with hCERT and the scientific insights raised by this experiment. All reviewers have also noted the importance of strengthening our observation that hCERT, in these cells, is localized at ER-PM MCS rather that the more widely reported localization at the Golgi. We highlight that many excellent studies which have localized CERT at the Golgi are performed in cultured, immortalized, mammalian cells. There are limited studies on the localization of this protein in primary cells, neurons or in polarized cells. With the additional experiments we have proposed in the revision for this aspect of the manuscript, we believe the findings will be of great novelty and widespread interest.

      We believe we can address almost all points raised by reviewers thereby strengthening this exciting manuscript.

      2. Description of the planned revisions

      Insert here a point-by-point reply that explains what revisions, additional experimentations and analyses are planned to address the points raised by the referees.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This manuscript dissects the physiological function of ceramide transfer protein (CERT) by studying the phenotype of CERT null Drosophila.

      dCERT null animals have a reduced electrical response to light in their photoreceptors, reduced baseline PIP2 accumulation in the cells and delayed re-synthesis of PIP2 and its precursor, PI4P after light stimulation. There are also reduced ER:PM contact sites at the rhabdomere and a corresponding reduction in the localization of PI/PA exchange protein, RDGB at this site. Therefore, the animals seem to have an impaired ability for sustaining phototransduction, which is nonetheless milder than that seen after loss of RDGB, for example. In terms of biochemical function, there is no overall change in ceramides, with some minor increases in specific short chain pools. There is however a large decrease in PE-ceramide species, again selective for a few molecular species. Curiously, decreasing ceramides with a mutant in ceramide synthesis is able to partially rescue both the electrical response and RDGB localization in dCERT flies, implying the increased ceramide species contribute to the phenotype. In addition, a mutation in PE-ceramide synthase largely phenocopies the dCERT null, exhiniting both increases ceramides and decreased PE-ceramide.

      In addition, dCERT flies were shown to have reduced localization of some plasma membrane proteins to detergent-resistant membrane fractions, as well as up regulation of the IRE1 and PERK stress-response pathways. Finally, dCERT nulls could be rescued with the human CERT protein, demonstrating conservation of core physiological function between these animals. Surprisingly, CERT is reported to localize to the ER:PM junctions at rhabdomeres, as opposed to the expected ER:Golgi contact sites. Specific areas where the manuscript could be strengthened include:

      Figure 2 studies the phototransduction system. Although clear changes in PI4P and PIP2 are seen, it would be interesting to see if changed PA accumulation occur in the dCERT animals, since RDGB localization is disrupted: this is expected to cause PM PA accumulation along with reduced PIP2 synthesis.

      It is an important question raised by the reviewer to check PA levels. In the present study we have noticed that localization of RDGB at the base of the rhabdomere in dcert1 is reduced but not completely removed. Consequently, one may consider the situation on dcert1 as a partial loss of function of RDGB and consistent with this, the delay in PI4P and PI(4,5)P2 resynthesis is not as severe as in rdgB9 which is a strong hypomorph (PMID: 26203165).

      rdgB9 mutants also show an elevation in PA levels and the reviewer is right that one might expect changes in PA levels too as RDGB is a PI/PA transfer protein. We expect that if measured, there will be a modest elevation in PA levels. However, previous work has shown that elevation of PA levels at the or close to the rhabdomere lead to retinal degeneration Specifically, elevated PA levels by dPLD overexpression disrupts rhabdomere biogenesis and leads to retinal degeneration (PMID: 19349583). Similarly, loss of the lipid transfer protein RDGB leads to photoreceptor degeneration (PMID: 26203165). In this study, we report that retinal degeneration is not a phenotype of dcert1. Thus measurements of PA levels though interesting may not be that informative in the context of the present study. However, if necessary, we can measure PA levels in dcert1.

      Lines 228-230 state: "These findings suggest an important contribution for reduced PE - Cer levels in the eye phenotypes of dcert". Does it not also suggest a contribution of the elevated ceramide species, since these are also observed in the CPES animals?

      We agree with the reviewer that not only reduced PE-Ceramide but also elevated ceramide levels in GMR>CPESi could contribute to the eye phenotype. This statement will be revised to reflect this conclusion.

      Figure 6D is a key finding that human CERT localized to the rhabdomere at ER:PM contact sites, though the reviewer was not convinced by these images. Is the protein truly localized to the contact sites, or simply have a pool of over-expressed protein localized to the surrounding cytoplasm? It also does not rule out localization (and therefore function) at ER:PM contact sites.

      Since hCERT completely rescued eye phenotype of dcert1 the localization we observe for hCERT must be at least partly relevant. We will perform additional IHC experiments to

      • Co-localize hCERT with an ER-PM MCS marker, e.g RDGB in wild type flies
      • Co-localize hCERT with VAP-A that is enriched at the ER-PM MCS. This should help to determine if there are MCS and non-MCS pools of hCERT in these cells. marker, e.g RDGB in wild type flies
      • Test if there is a pool of hCERT, in these cells that also localizes (or not) with the Golgi marker Golgin 84. These will be included in the revision to strengthen this important point.

      Statistics: There are a large number of t-tests employed that do not correct for multiple comparisons, for example in figures 3B, 3D, 3H, 4C, 6C, S2A, S2B, S3B and S3C.

      We will performed multiple comparisons with mentioned data and incorporate in the revised manuscript.

      There are two Western blotting sections in the methods.

      The first Western blotting methods is for general blots in the paper. The second western blotting section is related to the samples from detergent resistant membrane (DRM) fractions. We will clearly explain this information in the methods section of the manuscript.

      Reviewer #1 (Significance (Required)):

      Overall, the manuscript is clearly and succinctly written, with the data well presented and mostly convincing. The paper demonstrates clear phenotypes associated with loss of dCERT function, with surprising consequences for the function of a signaling system localized to ER:PM contact sites. To this reviewer, there seem to be three cogent observations of the paper: (i) loss of dCERT leads to accumulation of ceramides and loss of PE-ceramide, which together drive the phenotype. (ii) this ceramide alteration disrupts ER:PM contact sites and thus impairs phototransduction and (iii) rescue by human CERT and its apparent localization to ER:PM contact sites implies a potential novel site of action. Although surprising and novel, the significance of these observations are a little unclear: there is no obvious mechanism by which the elevated ceramide species and decreased PE-ceramide causes the specific failure in phototrasnduction, and the evidence for a novel site of action of CERT at the ER:PM contact sites is not compelling. Therefore, although an interesting and novel set of observations, the manuscript does not reveal a clear mechanistic basis for CERT physiological function.

      We thank reviewer for appreciating the quality of our manuscript while also highlighting points through which its impact can be enhanced. To our knowledge this is one of the first studies to tackle the challenging problem of a role for CERT in physiological function. We would like to highlight two points raised:

      • We do understand that the localisation of hCERT at ER-PM MCS is unusual compared to the traditional reported localization to ER-Golgi sites. This is important for the overall interpretation of the results in the paper on how dCERT regulates phototransduction. As indicated in response to an earlier comment by the reviewer we will perform additional experiments to strengthen our conclusion of the localization of hCERT.
      • With regard to how loss of dCERT affects phototransduction, we feel to likely mechanisms contribute. If the localization of hCERT to ER-PM MCS is verified through additional experiments (see proposal above) then it is important to note that ER-PM MCS in these cells includes the SMC (smooth endoplasmic reticulum) the major site of lipid synthesis. It is possible that loss of dCERT leads to ceramide accumulation in the smooth ER and disruption of ER-PM contacts. That may explain why reducing the levels of ceramide at this site partially rescues the eye phenotype.

      The multi-protein INAD-TRP-NORPA complex, central to phototransduction have previously been shown to localise to DRMs in photoreceptors. PE-Ceramides are important contributors to the formation of plasma membrane DRMs and we have presented biochemical evidence that the formation of these DRMs are reduced in the dcert1. This may be a mechanism contributing to reduced phototransduction. This latter mechanism has been proposed as a physiological function of DRMs but we think our data may be the first to show it in a physiological model.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary Non-vesicular lipid transfer by lipid transfer proteins regulates organelle lipid compositions and functions. CERT transfers ceramide from the ER to Golgi to produce sphingomyelin, although CERT function in animal development and physiology is less clear. Using dcert1 (a protein-null allele), this paper shows a disruption of the sole Drosophila CERT gene causes reduced ERG amplitude in photoreceptors. While the level and localization of phototransduction machinery appears unaffected, the level of PIP2 and the localization of RDGB are perturbed. Collectively, these observations establish a novel link between CERT and phospholipase signaling in phototransduction. To understand the molecular mechanism further, the authors performed lipid chromatography and mass spec to characterize ceramide species in dcert1. This analysis reveals that whereas the total ceramide remains unaffected, most PE-ceramide species are reduced. The authors use lace mutant (serine palmitoyl transferase) and CPES (ceramide phosphoethanolamine synthase) RNAi to distinguish whether it is the accumulation of ceramide in the ER or the reduction of sphingolipid derivates in the Golgi that is the cause for the reduced ERG amplitude. Mutating one copy of lace reduces ceramide level by 50% and partially rescues the ERG defect, suggesting that the accumulation of ceramide in the ER is a cause. CPES RNAi phenocopies the reduced ERG amplitude, suggesting the production of certain sphingolipid is also relevant.

      Major comments: 1. By showing the reduced PIP2 level, the decreased SMC sites at the base of rhabdomeres, and the diffused RDGB localization in dcert1, the authors favor the model, in which the disruption of ceramide metabolism affects PIP transport. However, it is unclear if the reduced PIP2 level (i.e., reduced PH-PLCd::GFP staining) is specific to the rhabdomeres. It should be possible to compare PH-PLCd::GFP signals in different plasma membranes between wildtype and dcert1. If PH-PLCd::GFP signal is specifically reduced at the rhabdomeres, this conclusion will be greatly strengthened. In addition, the photoreceptor apical plasma membrane includes rhabdomere and stalk membrane. Is the PH-PLCd::GFP signal at the stalk membrane also affected?

      Due to the physical organization of optics in the fly eye, the pseudopupil imaging method used in this study collects the signal for the PIP2 probe (PH-PLCd::GFP) mainly from the apical rhabdomere membrane of photoreceptors in live imaging experimental mode. Therefore, the PIP2 signal from these experiments cannot be used to interpret the level of PIP2 either at the stalk membrane or indeed the basolateral membrane.

      The point raised by the reviewer, i.e whether CERT selectively controls PIP2 levels at the rhabdomere membrane or not, is an interesting one. To do this, we will need to fix fly photoreceptors and determine the PH-PLCd::GFP signal using single slice confocal imaging. When combined with a stalk marker such as CRUMBS, it should be possible to address the question of which are the membrane domains at which dCERT controls PIP2 levels. If the sole mechanism of action of dCERT is via disruption of ER-PM MCS then only the apical rhabdomere membrane PIP2 should be affected leaving the stalk membrane and basolateral membrane unaffected.

      Thank you very much for raising this specific point.

      The analysis of RDGB localization should be done in mosaic dcert1 retinas, which will be more convincing with internal control for each comparison. In addition, the phalloidin staining in Figure 2J shows distinct patterns of adherens junctions, indicating that the wildtype and dcert1 were imaged at different focal planes.

      We understand that having mosaics is an alternative an elegant way to perform a a side by side analysis of control and mutant. However this would require significant investment of time and effort, perhaps beyond the scope of this study. If we were to perform a mosaic analysis, this would compromise our ERG analysis since ERG is an extracellular recording We feel that this is beyond the scope of this study and perhaps may not be necessary as such (see below).

      In the revision we will present equivalent sections of control and dcert1 taken from the nuclear plane of the photoreceptor. This should resolve the reviewer’s concerns.

      The significance of ceramide species levels in dcert1 and GMR>CPESRNAi needs to be explained better. Do certain alterations represent accumulation of ceramides in the ER?

      Species level analysis of changes in ceramides reveal that elevations in dcert1 are seen mainly in the short chain ceramides (14 and 16 carbon chains). These most likely represent the short chain ceramides synthesised in the ER and accumulating due to the block in further metabolism to PE-Cer due to depletion in CPES.

      Species level analysis of changes in ceramides reveal that in dcert1 there is a ceramide transport related defect leading to elevation, primarily, in the short chain ceramides (14 and 16 carbon chains), and this selective supply defect leads to a reduction in PE-Cer levels, with a maximum change in the ratio of short-chain Cer:PE Cer (Figure 3A-D). Though there is no apparent change in the total ceramide level the species specific elevation in the ceramides disturb the fine -balance between the short-chain ceramides and the long and very-long chain ceramides. As the function of long and very-long chain ceramides are implicated in dendrite development and neuronal morphology (doi: 10.1371/journal.pgen.1011880), therefore this alteration in the fine balance between different ceramide species probably impacts the integrity and fluidity of the membrane environment. On the other hand it leads to a possibility of a defined function of the short-chain ceramides in electrical responses to light signalling in the eye, especially with respect to the PE-ceramides that are reduced by around 50%.

      In contrast the GMR>CPESRNAi leads to more of a substrate accumulation showing ceramide increase (14, 16, 18, 20 carbon chains) and decrease in PE-Cer levels (Figure 4D, E). In this case Cer accumulation is due to the block in further metabolism to PE-Cer arising from depletion in CPES.

      We will include this in the discussion of a revised version.

      The suppression by lace is interpreted as evidence that the reduced ERG amplitude in dcert1 is caused by ceramide accumulation in the ER. This interpretation seems preliminary as lace may interact with dcert genetically by other mechanisms.

      The dcert1 mutant exhibits increased levels of short-chain ceramides (Fig 3B), whereas the lace heterozygous mutant (laceK05305/+) displays reduced short-chain ceramide levels (Supp Fig 2B). In the laceK05305/+; dcert1 double mutant, ceramide levels are lower than those observed in the dcert1 mutant alone (Supp Fig 2B), indicating a partial genetic rescue of the elevated ceramide phenotype.

      Furthermore, through multiple independent genetic manipulations that modulate ceramide metabolism (alterations of dcert, cpes and lace), we consistently observe that increased ceramide levels correlate with a reduction in ERG amplitude, suggesting that ceramide accumulation negatively impacts photoreceptor function. Taken together, these observations indicate that the reduction in ceramide levels in the laceK05305/+; dcert1 double mutant likely contributes to the suppression of the ERG defect observed in the dcert1 mutant.

      The authors show that ERG amplitude is reduced in GMR>CPESRNAi. While this phenocopying is consistent with the reduced ERG amplitude in dcert1 being caused by reduced production of PE-ceramide, GMR>CPESRNAi also shows an increase in total ceramide level. Could this support the hypothesis that reduced ERG amplitude is caused by an accumulation of ceramide elsewhere? In addition, is the ERG amplitude reduction in GMR>CPESRNAi sensitive to lace?

      We agree that in addition to reduced PE-Ceramide, the elevated ceramide levels in GMR>CPESi could contribute to the eye phenotype. We will introduce lace heterozygous mutant in the GMR>CPESi background to test the contribution of elevated ceramide levels in the *GMR>CPESi * background and incorporate the data in the revision. Thank you for this suggestion.

      Along the same line, while the total ceramide level is significantly reduced in lace heterozygotes, is the PE-ceramide level also reduced? If yes, wouldn't this be contradictory to PE-ceramide production being important for ERG amplitude?

      Mass spec measurements show that levels of PE-Cer were not reduced in lacek05305/+ compared to wild type. This data will be included in the revised manuscript. However, the ERG amplitude of these flies and also in those with lace depletion using two independent RNAi lines were not reduced.

      What is the explanation and significance for the age-dependent deterioration of ERG amplitude in dcert1? Likewise, the significance of no retinal degeneration is not clearly presented.

      There could be multiple reasons for the age dependent deterioration of the ERG amplitude, in the absence of retinal degeneration. Drosophila phototransduction cascade depends heavily on ATP production. The age dependent reduction in ATP synthesis could lead to deterioration in the ERG amplitude. These may include instability of the DRMs due to reduced PE-Cer, lower ATP levels due to mitochondrial dysfunction, an perhaps others. A previous study has shown that ATP production is highly reduced along with oxidative stress and metabolic dysfunction in dcert1 flies aged to 10 days and beyond (PMID: 17592126). The same study has also found no neuronal degeneration in dcert1 that phenocopies absence of photoreceptor degeneration in the present study. We will attempt a few experiments to rule in or rule out the these and revise the discussion accordingly.

      The rescue of dcert1 phenotype by the expression of human CERT is a nice result. In addition to demonstrating a functional conservation, it allows a determination of CERT protein localization. However, the quality of images in Figure 6D should be improved. The phalloidin staining was rather poor, and the CNX99A in the lower panel was over-exposed, generating bleed-through signals at the rhabdomeres. In addition, the localization of hCERT should be explored further. For instance, does hCERT colocalize with RDGB? Is the hCERT localization altered in lace or GMR>CPESRNAi background?

      As indicated in response to reviewer 1:

      We will perform additional IHC experiments to

      • Co-localize hCERT with an ER-PM MCS marker, e.g RDGB in wild type flies
      • Co-localize hCERT with VAP-A that is enriched at the ER-PM MCS. This should help to determine if there are MCS and non-MCS pools of hCERT in these cells. marker, e.g RDGB in wild type flies
      • Test if there is a pool of hCERT, in these cells that also localizes (or not) with the Golgi marker Golgin 84. These will be included in the revision to strengthen this important point.

      We will also attempt to perform hCERT localization in lace or GMR>CPESRNAi background

      Minor comments: 1. In Line 128, Df(732) should be Df(3L)BSC732.

      Changes will be incorporated in the main manuscript.

      GMR-SMSrRNAi shows an increase in ERG peak amplitude. Is there an explanation for this?

      GMR-SMSrRNAi did show slight increase in ERG peak amplitude but was not statistically significant.

      Reviewer #2 (Significance (Required)):

      Significance As CERT mutations are implicated in human learning disability, a better understanding of CERT function in neuronal cells is certainly of interest. While the link between ceramide transport and phospholipase signaling is novel and interesting, this paper does not clearly explain the mechanism. In addition, as the ERG were measured long after the retinal cells were deficient in CERT or CPES, it is difficult to assess whether the observed phenotype is a primary defect. Furthermore, the quality of some images needs to be improved. Thus, I feel the manuscript in its current form is too preliminary.

      We thank reviewer for highlighting the importance and significance of our work in the light of recent studies of CERT function in ID. As with all genetic studies it is difficult to completely disentangle the role of a gene during development from a role only in the adult. However, we will attempt to perhaps use the GAL80ts system to uncouple these two potential components of CERT function in photoreceptors. The goal will be to determine if CERT has a specific role only in adult photoreceptors or if this is coupled to a developmental role. Since ID is as a neurodevelopmental disorder, a developmental role for CERT would be equally interesting.

      As previously indicated images will be improved bearing in mind the reviewer comments.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary: Lipid transfer proteins (LTPs) shuttle lipids between organelle membranes at membrane contact sites (MCSs). While extensive biochemical and cell culture studies have elucidated many aspects of LTP function, their in vivo physiological roles are only beginning to be understood. In this manuscript, the authors investigate the physiological role of the ceramide transfer protein (CERT) in Drosophila adult photoreceptors-a model previously employed by this group to study LTP function at ER-PM contact sites under physiological conditions. Using a combination of genetic, biochemical, and physiological approaches, they analyze a protein-null mutant of dcert. They show that loss of dcert causes a reduction in electrical response to light with progressive decrease in electroretinogram (ERG) amplitude with age but no retinal degeneration. Lipidomic analysis shows that while the total levels of ceramides are not changed in dcert mutants, they do observe significant change in certain species of ceramides and depletion of downstream metabolite phosphoethanolamine ceramide (PE-Cer). Using fluorescent biosensors, the authors demonstrate reduced PIP2 levels at the plasma membrane, unchanged basal PI4P levels and slower resynthesis kinetics of both lipids following depletion. Electron microscopy and immunolabeling further reveal a reduced density of ER-PM MCSs and mislocalization of the MCS-resident lipid transfer protein RDGB. Genetic interaction studies with lace and RNAi-mediated knockdown of CPES support the conclusion that both ER ceramide accumulation and PM PE-Cer depletion contribute to the observed defects in dcert mutants. In addition, detergent-resistant membrane fractionation indicates altered plasma membrane organization in the absence of dcert. The study also reports upregulation of unfolded protein response transcripts, including IRE1 and PERK, suggesting increased ER stress. Finally, expression of human CERT rescues the reduced electrical response, demonstrating functional conservation across species. Overall the manuscript is well written that builds on established work and experiments are technically rigorous. The results are clearly presented and provide valuable insights into the physiological role of CERT.

      Major comments: 1.The reduced ERG amplitude appears to be the central phenotype associated with the loss of dcert, and most of the experiments in this manuscript effectively build a mechanistic framework to explain this observation. However, the experiments addressing detergent-resistant membrane domains (DRMs) and the unfolded protein response (UPR) seem somewhat disconnected from the main focus of the study. The DRM and UPR data feel peripheral and could benefit from few experiments for functional linkage to the ERG defect or should be moved to supplementary.

      We agree with the reviewer that further experiments are needed to link the DRM data to the ERG defects. That would need specific biochemical alteration at the PM to modulate PE-Cer species and their effect on scaffolding proteins required for phototransduction (that is beyond the scope of the present study). We will consider moving these to the supplementary section as suggested by the reviewer.

      2.The changes in ceramide species and reduction in PE-Cer are key findings of the study. These results should be further validated by performing a genetic rescue using the BAC or hCERT fly line to confirm that the lipidomic changes are specifically due to loss of CERT function.

      Thank you for this comment. We will include this in the revised manuscript.

      3.Figure 2B-C and 2E-F: Representative images corresponding to the quantified data should be included to illustrate the changes in PIP2 and PI4P reporters. Given that the fluorescence intensity of the PIP2 reporter at the PM is reduced in the dcert mutant relative to control, the authors should also verify that the reporter is expressed at comparable levels across genotypes.

      • As mentioned by the reviewer we will include representative images alongside our quantified data both of the basal ones and that of the kinetic study.
      • Western blot of reporters (PH-PLCd::GFP and P4M::GFP) across genotypes will be added to the revised manuscript. 4.Figure 2J-K: The partial mislocalization of RDGB represents an important observation that could mechanistically explain the reduced resynthesis of PI4P and PIP2 and consequently, the decreased ERG amplitude in dcert mutants. However, this result requires further validation. First, the authors should confirm whether this mislocalization is specific to RDGB by performing co-staining with another ER-PM MCS marker, such as VAP-A, to assess whether overall MCS organization is disrupted. Second, the quantification of RDGB enrichment at ER-PM MCSs should be refined. From the representative images, RDGB appears redistributed toward the photoreceptor cell body, but the presented quantification does not clearly reflect this shift. The authors should therefore include an analysis comparing RDGB levels in the cell body versus the submicrovillar region across genotypes. This analysis should be repeated for similar experiments across the study. Additionally, the total RDGB protein level should be quantified and reported. Finally, since RDGB mislocalization could directly contribute to the decreased ERG amplitude, it would be valuable to test whether overexpression of RDGB in dcert mutants can rescue the ERG phenotype.

      • In our ultrastructural studies (Fig. 2H, 2I and Sup. Fig. 1A, 1B) we did see reduction in PM-SMC MCS that was corroborated with RDGB staining.

      • Comparative ratio analysis of RDGB localisation at ER-PM MCS vs cell body will be included in the manuscript for all RDGB staining.
      • We have done western analysis for total RDGB protein level in ROR and dcert1. This data will be included in the revised manuscript.
      • This is a very interesting suggestion and we will test if RDGB overexpression can rescue ERG phenotype in dcert1.

      5.Figure 3F and I-J: Inclusion of appropriate WT and laceK05205/+ controls is necessary to allow proper interpretation of the results. These controls would strengthen the conclusions regarding the functional relationship between dcert and lace.

      Changes will be incorporated as per the suggestion.

      6.Figure 5C: The representative images shown here appear to contradict the findings described in Figure 2A. In Figure 5C, Rhodopsin 1 levels seem markedly reduced in the dcert mutants, whereas the text states that Rh1 levels are comparable between control and mutant photoreceptors. The authors should replace or reverify the representative images to ensure that they accurately reflect the conclusions presented in the text.

      We will reverify the representative images and changes will be accordingly incorporated.

      7.Figure 6D: The reported localization of hCERT to ER-PM MCSs is a key and potentially insightful observation, as it suggests the subcellular site of dcert activity in photoreceptors. However, the representative images provided are not sufficiently conclusive to support this claim. The authors should validate hCERT localization by co-staining with established markers like RDGB for ER-PM CNX99A for the ER and a Golgi marker since mammalian CERT is classically localized to ER-Golgi interfaces. Optionally, the authors could also quantify the relative distribution of hCERT among these compartments to provide a clearer assessment of its subcellular localization.

      As indicated in response to reviewer 1:

      We will perform additional IHC experiments to

      • Co-localize hCERT with an ER-PM MCS marker, e.g RDGB in wild type flies
      • Co-localize hCERT with VAP-A that is enriched at the ER-PM MCS. This should help to determine if there are MCS and non-MCS pools of hCERT in these cells. marker, e.g RDGB in wild type flies
      • Test if there is a pool of hCERT, in these cells that also localizes (or not) with the Golgi marker Golgin 84. These will be included in the revision to strengthen this important point.

      Minor comments: 1.In the first paragraph of introduction, authors should consider citing few of the key MCS literature.

      Additional literature will be included as per the suggestion.

      2.Line 132: data not shown is not acceptable. Authors should consider presenting the findings in the supplemental figure.

      Data will be added in supplement as per the suggestion.

      3.The authors should include a comprehensive table or Excel sheet summarizing all statistical analyses. This should include the sample size, type of statistical test used and exact p-values. Providing this information will improve the transparency, reproducibility and overall rigor of the study.

      We will provide all the statistical analyses in mentioned format as per the suggestion.

      4.The materials and methods section can be reorganized to include citation for flystocks which do not have stock number or RRIDs if the stocks were previously described but are not available from public repositories. They should expand on the details of various quantification methods used in the study. Finally including a section of Statistical analyses would further enhance transparency and reproducibility

      • Stock details will be added wherever missing as per the suggestion.
      • Statistical analyses section will be included in the material and methods. **Referee cross-commenting**

      1.I concur with Reviewer 1 regarding the need for more detailed reporting of statistical analyses.

      We will perform multiple comparisons with mentioned data and incorporate in the revised manuscript.

      2.I also agree with Reviewer 3 that the discussion should be expanded to address the age-dependent deterioration of ERG amplitude observed in the dcert mutants. This progressive decline could provide valuable insight into the long-term requirement of CERT function and signaling capacity at the photoreceptor membrane.

      Expanded discussion on the age dependent ERG amplitude decline will be incorporated in the discussion as per the suggestion.

      Reviewer #3 (Significance (Required)):

      This study explores the physiological function of CERT, a LTP localized at MCSs in Drosophila photoreceptors and uncovers a novel role in regulating plasma membrane PE-Cer levels and GPCR-mediated signaling. These findings significantly advances our understanding of how CERT-mediated lipid transport regulates G-protein coupled phospholipase C signaling in vivo. This work also highlights Drosophila photoreceptors as a powerful system to analyze the physiological significance of lipid-dependent signaling processes. This work will be of interest to researchers in neuronal cell biology, membrane dynamics and lipid signaling community. This review is based on my expertise in neuronal cell biology.

      We thank the reviewer for appreciating the significance of our work from a neuroscience perspective.

      • *

      3. Description of the revisions that have already been incorporated in the transferred manuscript

      Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. If no revisions have been carried out yet, please leave this section empty.

      • *

      4. Description of analyses that authors prefer not to carry out

      Please include a point-by-point response explaining why some of the requested data or additional analyses might not be necessary or cannot be provided within the scope of a revision. This can be due to time or resource limitations or in case of disagreement about the necessity of such additional data given the scope of the study. Please leave empty if not applicable.

      • *

      We can address all reviewer points in the revision. However, we will not be able to perform a mosaic analysis of the impact of dcert1 mutant in the retina. We feel this is beyond the scope of this revision. In our response, we have highlighted how controls included in the revision offset the need for a mosaic analysis at this stage.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      Thanks for this nice summary of our paper.

      The following points could be addressed in a revision:

      (1) The authors conclude that much of the person-to-person and strain-to-strain variation seems idiosyncratic to individual sera rather than age groups. This point is not yet fully convincing. While the mean titer of an individual may be idiosyncratic to the individual sera, the strain-to-strain variation still reveals some patterns that are consistent across individuals (the authors note the effects of substitutions at sites 145 and 275/276). A more detailed analysis, removing the individual-specific mean titer, could still show shared patterns in groups of individuals that are not necessarily defined by the birth cohort.

      As the reviewer suggests, we normalized the titers for all sera to the geometric mean titer for each individual in the US-based pre-vaccination adults and children. This is only for the 2023-circulating viral strains. We then faceted these normalized titers by the same age groups we used in Figure 6, and the resulting plot is shown. Although there are differences among virus strains (some are better neutralized than others), there are not obvious age group-specific patterns (eg, the trends in the two facets are similar). This observation suggests that at least for these relatively closely related recent H3N2 strains, the strain-to-strain variation does not obviously segregate by age group. Obviously, it is possible (we think likely) that there would be more obvious age-group specific trends if we looked at a larger swath of viral strains covering a longer time range (eg, over decades of influenza evolution). We have added the new plots shown as a Supplemental Figure 6 in the revised manuscript.

      (2) The authors show that the fraction of sera with a titer 138 correlates strongly with the inferred growth rate using MLR. However, the authors also note that there exists a strong correlation between the MLR growth rate and the number of HA1 mutations. This analysis does not yet show that the titers provide substantially more information about the evolutionary success. The actual relation between the measured titers and fitness is certainly more subtle than suggested by the correlation plot in Figure 5. For example, the clades A/Massachusetts and A/Sydney both have a positive fitness at the beginning of 2023, but A/Massachusetts has substantially higher relative fitness than A/Sydney. The growth inference in Figure 5b does not appear to map that difference, and the antigenic data would give the opposite ranking. Similarly, the clades A/Massachusetts and A/Ontario have both positive relative fitness, as correctly identified by the antigenic ranking, but at quite different times (i.e., in different contexts of competing clades). Other clades, like A/St. Petersburg are assigned high growth and high escape but remain at low frequency throughout. Some mention of these effects not mapped by the analysis may be appropriate.

      Thanks for the nice summary of our findings in Figure 5. However, the reviewer is misreading the growth charts when they say that A/Massachusetts/18/2022 has a substantially higher fitness than A/Sydney/332/2023. Figure 5a (reprinted at left panel) shows the frequency trajectory of different variants over time. While A/Massachusetts/18/2022 reaches a higher frequency than A/Sydney/332/2023, the trajectory is similar and the reason that A/Massachusetts/18/2022 reached a higher max frequency is that it started at a higher frequency at the beginning of 2023. The MLR growth rate estimates differ from the maximum absolute frequency reached: instead, they reflect how rapidly each strain grows relative to others. In fact, A/Massachusetts/18/2022 and A/Sydney/332/2023 have similar growth rates, as shown in Supplemental Figure 6b (reprinted at right). Similarly, A/Saint-Petersburg/RII-166/2023 starts at a low initial frequency but then grows even as A/Massachusetts/18/2022 and A/Sydney/332/2023 are declining, and so has a higher growth rate than both of those. 

      In the revised manuscript, we have clarified how viral growth rates are estimated from frequency trajectories, and how growth rate differs from max frequency in the text below:

      “To estimate the evolutionary success of different human H3N2 influenza strains during 2023, we used multinomial logistic regression, which analyzes strain frequencies over time to calculate strain-specific relative growth rates [51–53]. There were sufficient sequencing counts to reliably estimate growth rates in 2023 for 12 of the HAs for which we measured titers using our sequencing-based neutralization assay libraries (Figure 5a,b and Supplemental Figure 9a,b). Note that these growth rates estimate how rapidly each strain grows relative to the other strains, rather than the absolute highest frequency reached by each strain “.  

      (3) For the protection profile against the vaccine strains, the authors find for the adult cohort that the highest titer is always against the oldest vaccine strain tested, which is A/Texas/50/2012. However, the adult sera do not show an increase in titer towards older strains, but only a peak at A/Texas. Therefore, it could be that this is a virus-specific effect, rather than a property of the protection profile. Could the authors test with one older vaccine virus (A/Perth/16/2009?) whether this really can be a general property?

      We are interested in studying immune imprinting more thoroughly using sequencing-based neutralization assays, but we note that the adults in the cohorts we studied would have been imprinted with much older strains than included in this library. As this paper focuses on the relative fitness of contemporary strains with minor secondary points regarding imprinting, these experiments are beyond the scope of this study. We’re excited for future work (from our group or others) to explore these points by making a new virus library with strains from multiple decades of influenza evolution. 

      Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, which will be relevant across pathogens (assuming the assay can be appropriately adapted). I only have a few comments, focused on maximising the information provided by the sera.

      Thanks very much!

      Firstly, one of the major findings is that there is wide heterogeneity in responses across individuals. However, we could expect that individuals' responses should be at least correlated across the viruses considered, especially when individuals are of a similar age. It would be interesting to quantify the correlation in responses as a function of the difference in ages between pairs of individuals. I am also left wondering what the potential drivers of the differences in responses are, with age being presumably key. It would be interesting to explore individual factors associated with responses to specific viruses (beyond simply comparing adults versus children).

      We thank the reviewer for this interesting idea. We performed this analysis (and the related analyses described) and added this as a new Supplemental Figure 7, which is pasted after the response to the next related comment by the reviewer. 

      For 2023-circulating strains, we observed basically no correlation between the strength of correlation between pairs of sera and the difference in age between those pairs of sera (Supplemental Figure 7), which was unsurprising given the high degree of heterogeneity between individual sera (Figure 3, Supplemental Figure 6, and Supplemental Figure 8). For vaccine strains, there is a moderate negative correlation only in the children, but not in the adults or the combined group of adults and children. This could be because the children are younger with limited and potentially more similar vaccine and exposure histories than the adults. It could also be because the children are overall closer in age than the adults.

      Relatedly, is the phylogenetic distance between pairs of viruses associated with similarity in responses?

      For 2023-circulating strains, across sera cohorts we observed a weak-to-moderate correlation between the strength of correlation between the neutralizing titers across all sera to pairs of viruses and the Hamming distances between virus pairs. For the same comparison with vaccine strains, we observed moderate correlations, but this must be caveated with the slightly larger range of Hamming distances between vaccine strains. Notably, many of the points on the negative correlation slope are a mix of egg- and cell-produced vaccine strains from similar years, but there are some strain comparisons where the same year’s egg- and cell-produced vaccine strains correlate poorly.

      Figure 5C is also a really interesting result. To be able to predict growth rates based on titers in the sera is fascinating. As touched upon in the discussion, I suspect it is really dependent on the representativeness of the sera of the population (so, e.g., if only elderly individuals provided sera, it would be a different result than if only children provided samples). It may be interesting to compare different hypotheses - so e.g., see if a population-weighted titer is even better correlated with fitness - so the contribution from each individual's titer is linked to a number of individuals of that age in the population. Alternatively, maybe only the titers in younger individuals are most relevant to fitness, etc.

      We’re very interested in these analyses, but suggest they may be better explored in subsequent works that could sample more children, teenagers and adults across age groups. Our sera set, as the reviewer suggests, may be under-powered to perform the proposed analysis on subsetted age groups of our larger age cohorts. 

      In Figure 6, the authors lump together individuals within 10-year age categories - however, this is potentially throwing away the nuances of what is happening at individual ages, especially for the children, where the measured viruses cross different groups. I realise the numbers are small and the viruses only come from a small numbers of years, however, it may be preferable to order all the individuals by age (y-axis) and the viral responses in ascending order (x-axis) and plot the response as a heatmap. As currently plotted, it is difficult to compare across panels

      This is a good suggestion. In the revised manuscript we have included a heatmap of the children and pre-vaccination adults, ordered by the year of birth of each individual, as Supplemental figure 8. That new figure is also pasted in this response.

      Reviewer #3 (Public review):

      The authors use high-throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. However, there are some areas where I thought the work could be more strongly motivated and linked together. In particular, how the vaccine responses in US and Australia in Figures 6-7 relate to the earlier analysis around growth rates, and what we would expect the relationship between growth rate and population immunity to be based on epidemic theory.

      Thank you for this nice summary. This reviewer also notes that the text related to figures 6 and 7 are more secondary to the main story presented in figures 3-5. The main motivation for including figures 6 and 7 were to demonstrate the wide-ranging applications of sequencing-based neutralization data. We have tried to clarify this with the following minor text revisions, which do not add new content but we hope smooth the transition between results sections. 

      While the preceding analyses demonstrated the utility of sequencing-based neutralization assays for measuring titers of currently circulating strains, our library also included viruses with HAs from each of the H3N2 influenza Northern Hemisphere vaccine strains from the last decade (2014 to 2024, see Supplemental Table 1). These historical vaccine strains cover a much wider span of evolutionary diversity that the 2023-circulating strains analyzed in the preceding sections (Figure 2a,b and Supplemental Figure 2b-e). For this analysis, we focused on the cell-passaged strains for each vaccine, as these are more antigenically similar to their contemporary circulating strains than the egg-passaged vaccine strains since they lack the mutations that arise during growth of viruses in eggs [55–57] (Supplemental Table 1). 

      Our sequencing-based assay could also be used to assess the impact of vaccination on neutralization titers against the full set of strains in our H3N2 library. To do this, we analyzed matched 28-day post-vaccination samples for each of the above-described 39 pre-vaccination samples from the cohort of adults based in the USA (Table 1). We also analyzed a smaller set of matched pre- and post-vaccination sera samples from a cohort of eight adults based in Australia (Table 1). Note that there are several differences between these cohorts: the USA-based cohort received the 2023-2024 Northern Hemisphere egg-grown vaccine whereas the Australia-based cohort received the 2024 Southern Hemisphere cell-grown vaccine, and most individuals in the USA-based cohort had also been vaccinated in the prior season whereas most individuals in the Australia-based cohort had not. Therefore, multiple factors could contribute to observed differences in vaccine response between the cohorts.

      Reviewer #3 (Recommendations for the authors):

      Main comments:

      (1) The authors compare titres of the pooled sera with the median titres across individual sera, finding a weak correlation (Figure 4). I was therefore interested in the finding that geometric mean titre and median across a study population are well correlated with growth rate (Supplemental Figure 6c). It would be useful to have some more discussion on why estimates from a pool are so much worse than pooled estimates.

      We thank this reviewer for this point. We would clarify that pooling sera is the equivalent of taking the arithmetic mean of the individual sera, rather than the geometric mean or median, which tends to bias the measurements of the pool to the outliers within the pool. To address this reviewer’s point, we’ve added the following text to the manuscript:

      “To confirm that sera pools are not reflective of the full heterogeneity of their constituent sera, we created equal volume pools of the children and adult sera and measured the titers of these pools using the sequencing-based neutralization assay. As expected, neutralization titers of the pooled sera were always higher than the median across the individual constituent sera, and the pool titers against different viral strains were only modestly correlated with the median titers across individual sera (Figure 4). The differences in titers across strains were also compressed in the serum pools relative to the median across individual sera (Figure 4). The failure of the serum pools to capture the median titers of all the individual sera is especially dramatic for the children sera (Figure 4) because these sera are so heterogeneous in their individual titers (Figure 3b). Taken together, these results show that serum pools do not fully represent individual-level heterogeneity, and are similar to taking the arithmetic mean of the titers for a pool of individuals, which tends to be biased by the highest titer sera”.

      (2) Perhaps I missed it, but are growth rates weekly growth rates? (I assume so?)

      The growth rates are relative exponential growth rates calculated assuming a serial interval of 3.6 days. We also added clarifying language and a citation for the serial growth interval to the methods section:

      The analysis performing H3 HA strain growth rate estimates using the evofr[51] package is at https://github.com/jbloomlab/flu_H3_2023_seqneut_vs_growth. Briefly, we sought to make growth rate estimates for the strains in 2023 since this was the same timeframe when the sera were collected. To achieve this, we downloaded all publicly-available H3N2 sequences from the GISAID[88] EpiFlu database, filtering to only those sequences that closely matched a library HA1 sequence (within one HA1 amino-acid mutation) and were collected between January 2023 and December 2023. If a sequence was within one HA1 amino-acid mutation of multiple library HA1 proteins then it was assigned to the closest one; if there were multiple equally close matches then it was assigned fractionally to each match. We only made growth rate estimates for library strains with at least 80 sequencing counts (Supplemental Figure 9a), and ignored counts for sequences that did not match a library strain (equivalent results were obtained if we instead fit a growth rate for these sequences as an “other” category). We then fit multinomial logistic regression models using the evofr[51] package assuming a serial interval of 3.6 days[101]  to the strain counts. For the plot in Figure 5a the frequencies are averaged over a 14-day sliding window for visual clarity, but the fits were to the raw sequencing counts. For most of the analyses in this paper we used models based on requiring 80 sequencing counts to make an estimate for strain growth rates, and counting a sequence as a match if it was within one amino-acid mutation; see https://jbloomlab.github.io/flu_H3_2023_seqneut_vs_growth/ for comparable analyses using different reasonable sequence count cutoffs (e.g., 60, 50, 40 and 30, as depicted in Supplemental Figure 9).  Across sequence cutoffs, we found that the fraction of individuals with low neutralization titers and number of HA1 mutations correlated strongly with these MLR-estimated strain growth rates.

      (3)  I found Figure 3 useful in that it presents phylogenetic structure alongside titres, to make it clearer why certain clusters of strains have a lower response. In contrast, I found it harder to meaningfully interpret Figure 7a beyond the conclusion that vaccines lead to a fairly uniform rise in titre. Do the 275 or 276 mutations that seem important for adults in Figure 3 have any impact?

      We are certainly interested in the questions this reviewer raises, and in trying to understand how well a seasonal vaccine protects against the most successful influenza variants that season. However, these post-vaccination sera were taken when neutralizing titers peak ~30 days after vaccination. Because of this, in the larger cohort of US-based post-vaccination adults, the median titers across sera to most strains appear uniformly high. In the Australian-based post-vaccination adults, there was some strain-to-strain variation in median titers across sera, but of course this must be caveated with the much smaller sample size. It might be more relevant to answer this question with longitudinally sampled sera, when titers begin to wane in the following months.

      (4)  It could be useful to define a mechanistic relationship about how you would expect susceptibility (e.g. fraction with titre < X, where X is a good correlate) to relate to growth via the reproduction number: R = R0 x S. For example, under the assumption the generation interval G is the same for all, we have R = exp(r*G), which would make it possible to make a prediction about how much we would expect the growth rate to change between S = 0.45 and 0.6, as in Fig 5c. This sort of brief calculation (or at least some discussion) could add some more theoretical underpinning to the analysis, and help others build on the work in settings with different fractions with low titres. It would also provide some intuition into whether we would expect relationships to be linear.

      This is an interesting idea for future work! However, the scope of our current study is to provide these experimental data and show a correlation with growth; we hope this can be used to build more mechanistic models in future.

      (5) A key conclusion from the analysis is that the fraction above a threshold of ~140 is particularly informative for growth rate prediction, so would it be worth including this in Figure 6-7 to give a clearer indication of how much vaccination reduces contribution to strain growth among those who are vaccinated? This could also help link these figures more clearly with the main analysis and question.

      Although our data do find ~140 to be the threshold that gives max correlation with growth rate, we are not comfortable strongly concluding 140 is a correlate of protection, as titers could influence viral fitness without completely protecting against infection. In addition, inspection of Figure 5d shows that while ~140 does give the maximal correlation, a good correlation is observed for most cutoffs in the range from ~40 to 200, so we are not sure how robustly we can be sure that ~140 is the optimal threshold.

      (6)  In Figure 5, the caption doesn't seem to include a description for (e).

      Thank you to the reviewer for catching this – this is fixed now.

      (7)  The US vs Australia comparison could have benefited from more motivation. The authors conclude ,"Due to the multiple differences between cohorts we are unable to confidently ascribe a cause to these differences in magnitude of vaccine response" - given the small sample sizes, what hypotheses could have been tested with these data? The comparison isn't covered in the Discussion, so it seems a bit tangential currently.

      Thank you to the reviewer for this comment, but we should clarify our aim was not to directly compare US and Australian adults. We are interested in regional comparisons between serum cohorts, but did not have the numbers to adequately address those questions here. This section (and the preceding question) were indeed both intended to be tangential to the main finding, and hopefully this will be clarified with our text additions in response to Reviewer #3’s public reviews.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathways and inflammation correlates, and disease progression. The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with a mechanistic study.  

      The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours. The second section inspects a previously published snRNAseq dataset, and labels some of the published cells as subtypes C1, C2, C3 (Methods could be clarified here), among other cells labelled as immune cell types. Further details about how the previously reported single-nuclei were assigned to the newly described subtypes C1-C3 require clarification.

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A). 

      The tumour samples are obtained from multiple locations in the body (Figure 1A). It will be important to see further investigation of how the sample origin is distributed among the C1C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.

      Thank you for your valuable suggestion. In the revised manuscript (lines 74-79), Figure. 1A, Table S1 and Supplementary Figure. 1A, we harmonized anatomic site annotations from our PPGL cohort and the TCGA cohort and analyzed the distribution of tumor origin (adrenal vs extra-adrenal) across subtypes. The site composition is essentially uniform across C1-C3— approximately 75% pheochromocytoma (PC) and 25% paraganglioma (PG)—with only minimal variation. Notably, the proportion of extra-adrenal origin (paraganglioma origin) is slightly higher in the C1 subtype (see Supplementary Figure 1A), which aligns with the biological characteristics of tumors from this anatomical site, which typically exhibit more aggressive behavior.

      Reviewer #2 (Public Review):

      A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods. The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNA-seq on all PPGL samples in clinical practice, some potential proxies are proposed.

      We sincerely thank the reviewer for their positive assessment of our study’s rationale. We fully agree that RNA sequencing for all PPGL samples remains resource-intensive in current clinical practice, and its widespread application still faces feasibility challenges. It is precisely for this reason that, after defining transcriptional subtypes, we further focused on identifying and validating practical molecular markers and exploring their detectability at the protein level.

      In this study, we validated key markers such as ANGPT2, PCSK1N, and GPX3 using immunohistochemistry (IHC), demonstrating their ability to effectively distinguish among molecular subtypes (see Figure. 5). This provides a potential tool for the clinical translation of transcriptional subtyping, similar to the transcription factor-based subtyping in small cell lung cancer where IHC enables low-cost and rapid molecular classification.

      It should be noted that the subtyping performance of these markers has so far been preliminarily validated only in our internal cohort of 87 PPGL samples. We agree with the reviewer that largerscale, multi-center prospective studies are needed in the future to further establish the reliability and prognostic value of these markers in clinical practice.

      The performance of some of the proxy markers for transcriptional subtype is not presented.

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping. In our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      There is limited prognostic information available.

      Thank you for your valuable suggestion. In this exploratory revision, we present the available prognostic signal in Figure. 5C. Given the current event numbers and follow-up time, we intentionally limited inference. We are continuing longitudinal follow-up of the PPGL cohort and will periodically update and report mature time-to-event analyses in subsequent work.

      Reviewer #1 (Recommendations for the authors):

      There is no deposition reference for the RNAseq transcriptomics data. Have the data been deposited in a suitable data repository?

      Thank you for your valuable suggestion. We have updated the Data availability section (lines 508–511) to clarify that the bulk-tissue RNA-seq datasets generated in this study are available from the corresponding author upon reasonable request.

      In the snRNAseq analysis of existing published data, clarify how cells were labelled as "C1", "C2", "C3", alongside cells labelled by cell type (the latter is described briefly in the Methods).

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A).

      Package versions should be included (e.g., CellChat, monocle2).

      We greatly appreciate your comments and have now added a dedicated “Software and versions” subsection in Methods. Specifically, we report Seurat (v4.4.0), sctransform (v0.4.2), CellChat (v2.2.0), monocle (v2.36.0; monocle2), pheatmap (v1.0.13), clusterProfiler (v4.16.0), survival (v3.8.3), and ggplot2 (v3.5.2) (lines 514-516). We also corrected a typographical error (“mafools” → “maftools”) (lines 463).

      Reviewer #2 (Recommendations for the authors):

      It would be helpful to provide a little more detail on the clinical composition of the cohort (e.g., phaeo vs paraganglioma, age, etc.) in the text, acknowledging that this is done in Figure 1.

      Thank you for your valuable suggestion. In the revision, we added Table S1 that provides a detailed summary of the clinical composition of the PPGL cohort. Specifically, we report the numbers and proportions (Supplementary Figure. 1A) of pheochromocytoma (PC) versus paraganglioma (PG), further subclassifying PG into head and neck (HN-PG), retroperitoneal (RPPG), and bladder (BC-PG).

      How many of each transcriptional subtype had driver mutations (germline or somatic)? This is included in the figures but would be worth mentioning in the text. Presumably, some of these may be present but not detected (e.g., non-coding variants), and this should be commented on. It is feasible that if methods to detect all the relevant genomic markers were improved, then the rate of tumours without driver mutations would be less and their prognostic utility would be more comprehensive.

      Thank you for your valuable suggestion. In the revision (lines 113–116), we now report the prevalence of driver mutations (germline or somatic) overall and by transcriptional subtype. We analyzed variant data across 84 PPGL-relevant genes from 179 tumors in the TCGA cohort and 30 tumors in Magnus’s cohort (Fig. 2A; Table S2). High-frequency genes were consistent with known biology—C1 enriched for [e.g., VHL/SDHB], C2 for [e.g., RET/HRAS], and C3 for [e.g., SDHA/SDHD]. We also note that a subset of tumors lacked an identifiable driver, which likely reflects current assay limitations (e.g., non-coding or structural variants, subclonality, and purity effects). Broader genomic profiling (deep WGS/long-read, RNA fusion, methylation) would be expected to reduce the “driver-negative” fraction and further enhance the prognostic utility of these classifiers.

      ANGPT2 provides a reasonable predictive capacity for the C1 subtype as defined by the ROC AUC. What was the performance of the PCSK1N and GPX3 as markers of the other subtypes?

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping, and we have supplemented the analysis with ROC and AUC values for two additional parameters (Author response image 1 , see below). Furthermore, in our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      Author response image 1.

      Extended Data Figure A-B. (A) The ROC curve illustrates the diagnostic ability to distinguish PCSK1N expression in PPGLs, specifically differentiating subtype C2 from non-C2 subtypes. The red dot indicates the point with the highest sensitivity (93.1%) and specificity (82.8%). AUC, the area under the curve. (B) The ROC curve illustrates the diagnostic ability to distinguish GPX3 expression in PPGLs, specifically differentiating subtype C3 from non-C3 subtypes. The red dot indicates the point with the highest sensitivity (83.0%) and specificity (58.8%). AUC, the area under the curve.

      In the discussion, I think it would be valuable to summarise existing clinical/molecular predictors in PPGL and, acknowledging that their performance may be limited, compare them to the potential of these novel classifiers.

      Thank you for your valuable suggestion. We have added a concise overview of established clinical and molecular predictors in PPGL and compared them with the potential of our transcriptional classifiers. The new paragraph (Discussion, lines 315–338) now reads:

      “Compared to existing clinical and molecular predictors, risk assessment in PPGL has long relied on the following indicators: clinicopathological features (e.g., tumor size, non-adrenal origin, specific secretory phenotype, Ki-67 index), histopathological scoring systems (such as PASS/GAPP), and certain genetic alterations (including high-risk markers like SDHB inactivation mutations, as well as susceptibility gene mutations in ATRX, TERT promoter, MAML3, VHL, NF1, among others). Although these metrics are highly actionable in clinical practice, they exhibit several limitations: first, current molecular markers only cover a subset of patients, and technical constraints hinder the detection of many potentially significant variants (e.g., non-coding mutations), thereby compromising the comprehensiveness of prognostic evaluation; second, histopathological scoring is susceptible to interobserver variability; furthermore, the lack of standardized detection and evaluation protocols across institutions limits the comparability and generalizability of results. Our transcriptomic classification system—comprising C1 (pseudohypoxic/angiogenic signature), C2 (kinase-signaling signature), and C3 (SDHx-related signature)—provides a complementary approach to PPGL risk assessment. These subtypes reflect distinct biological backgrounds tied to specific genetic alterations and can be approximated by measuring the expression of individual genes (e.g., ANGPT2, PCSK1N, or GPX3). This study demonstrates that the classifier offers three major advantages: first, it accurately distinguishes subtypes with coherent biological features; second, it retains significant predictive value even after adjusting for clinical covariates; third, it can be implemented using readily available assays such as immunohistochemistry. These findings suggest that integrating transcriptomic subtyping with conventional clinical markers may offer a more comprehensive and generalizable risk stratification framework. However, this strategy would require validation through multi-center prospective studies and standardization of detection protocols.”

      A little more explanation of the principles behind WGCNA would be useful in the methods.

      We are grateful for your comments. We have expanded the Methods to briefly explain the principles of WGCNA (lines 426-454). In short, WGCNA constructs a weighted coexpression network from normalized gene expression, identifies modules of tightly co-expressed genes, summarizes each module by its eigengene (the first principal component), and then correlates module eigengenes with phenotypes (e.g., transcriptional subtypes) to highlight biologically meaningful gene sets and candidate hub genes. We now specify our preprocessing, choice of softthresholding power to approximate scale-free topology, module detection/merging criteria, and the statistics used for module–trait association and downstream gene-set scoring. 

      On line 234, I think the figure should be 5C?

      We greatly appreciate your comments and Correct to Figure 5C.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Weakness:

      I wonder how task difficulty and linguistic labels interact with the current findings. Based on the behavioral data, shapes with more geometric regularities are easier to detect when surrounded by other shapes. Do shape labels that are readily available (e.g., "square") help in making accurate and speedy decisions? Can the sensitivity to geometric regularity in intraparietal and inferior temporal regions be attributed to differences in task difficulty? Similarly, are the MEG oddball detection effects that are modulated by geometric regularity also affected by task difficulty?

      We see two aspects to the reviewer’s remarks.

      (1) Names for shapes.

      On the one hand, is the question of the impact of whether certain shapes have names and others do not in our task. The work presented here is not designed to specifically test the effect of formal western education; however, in previous work (Sablé-Meyer et al., 2021), we noted that the geometric regularity effect remains present even for shapes that do not have specific names, and even in participants who do not have names for them. Thus, we replicated our main effects with both preschoolers and adults that did not attend formal western education and found that our geometric feature model remained predictive of their behavior; we refer the reader to this previous paper for an extensive discussion of the possible role of linguistic labels, and the impact of the statistics of the environment on task performance.  

      What is more, in our behavior experiments we can discard data from any shape that is has a name in English and run our model comparison again. Doing so diminished the effect size of the geometric feature model, but it remained predictive of human behavior: indeed, if we removed all shapes but kite, rightKite, rustedHinge, hinge and random (i.e., more than half of our data, and shapes for which we came up with names but there are no established names), we nevertheless find that both models significantly correlate with human behavior—see plot in Author response image 1, equivalent of our Fig. 1E with the remaining shapes.

      Author response image 1.

      An identical analysis on the MEG leads to two noisy but significant clusters (CNN: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008). We have improved our manuscript thanks to the reviewer’s observation by adding a figure with the new behavior analysis to the supplementary figures and in the result section of the behavior task. We now refer to these analysis where appropriate:

      (intro) “The effect appeared as a human universal, present in preschoolers, first-graders, and adults without access to formal western math education (the Himba from Namibia), and thus seemingly independent of education and of the existence of linguistic labels for regular shapes.”

      (behavior results) “Finally, to separate the effect of name availability and geometric features on behavior, we replicated our analysis after removing the square, rectangle, trapezoids, rhombus and parallelogram from our data (Fig. S5D). This left us with five shapes, and an RDM with 10 entries, When regressing it in a GLM with our two models, we find that both models are still significant predictors (p<.001). The effect size of the geometric feature model is greatly reduced, yet remained significantly higher than that of the neural network model (p<.001).”

      (meg results) “This analysis yielded similar clusters when performed on a subset of shapes that do not have an obvious name in English, as was the case for the behavior analysis (CNN Encoding: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008).”

      (discussion, end of behavior section) “Previously, we only found such a significant mixture of predictors in uneducated humans (whether French preschoolers or adults from the Himba community, mitigating the possible impact of explicit western education, linguistic labels, and statistics of the environment on geometric shape representation) (Sablé-Meyer et al., 2021).”

      Perhaps the referee’s point can also be reversed: we provide a normative theory of geometric shape complexity which has the potential to explain why certain shapes have names: instead of seeing shape names as the cause of their simpler mental representation, we suggest that the converse could occur, i.e. the simpler shapes are the ones that are given names.

      (2) Task difficulty

      On the other hand is the question of whether our effect is driven by task difficulty. First, we would like to point out that this point could apply to the fMRI task, which asks for an explicit detection of deviants, but does not apply to the MEG experiment. In MEG, participants passively looked at sequences of shapes which, for a given block, comprising many instances of a fixed standard shape and rare deviants–even if they notice deviants, they have no task related to them. Yet two independent findings validated the geometric features model: there was a large effect of geometric regularity on the MEG response to deviants, and the MEG dissimilarity matrix between standard shapes correlated with a model based on geometric features, better than with a model based on CNNs. While the response to rare deviants might perhaps be attributed to “difficulty” (assuming that, in spite of the absence of an explicit task, participants try to spot the deviants and find this self-imposed task more difficult in runs with less regular shapes), it seems very hard to explain the representational similarity analysis (RSA) findings based on difficulty. Indeed, what motivated us to use RSA analysis in both fMRI and MEG was to stop relying on the response to deviants, and use solely the data from standard or “reference” shapes, and model their neural response with theory-derived regressors.

      We have updated the manuscript in several places to make our view on these points clearer:

      (experiment 4) “This design allowed us to study the neural mechanisms of the geometric regularity effect without confounding effects of task, task difficulty, or eye movements.”

      (figure 4, legend) “(A) Task structure: participants passively watch a constant stream of geometric shapes, one per second (presentation time 800ms). The stimuli are presented in blocks of 30 identical shapes up to scaling and rotation, with 4 occasional deviant shape. Participants do not have a task to perform beside fixating.”

      Reviewer #2 (Public review):

      Weakness:

      Given that the primary take away from this study is that geometric shape information is found in the dorsal stream, rather than the ventral stream there is very little there is very little discussion of prior work in this area (for reviews, see Freud et al., 2016; Orban, 2011; Xu, 2018). Indeed, there is extensive evidence of shape processing in the dorsal pathway in human adults (Freud, Culham, et al., 2017; Konen & Kastner, 2008; Romei et al., 2011), children (Freud et al., 2019), patients (Freud, Ganel, et al., 2017), and monkeys (Janssen et al., 2008; Sereno & Maunsell, 1998; Van Dromme et al., 2016), as well as the similarity between models and dorsal shape representations (Ayzenberg & Behrmann, 2022; Han & Sereno, 2022).

      We thank the reviewer for this opportunity to clarify our writing. We want to use this opportunity to highlight that our primary finding is not about whether the shapes of objects or animals (in general) are processed in the ventral versus or the dorsal pathway, but rather about the much more restricted domain of geometric shapes such as squares and triangles. We propose that simple geometric shapes afford additional levels of mental representation that rely on their geometric features – on top of the typical visual processing. To the best of our knowledge, this point has not been made in the above papers.

      Still, we agree that it is useful to better link our proposal to previous ones. We have updated the discussion section titled “Two Visual Pathways” to include more specific references to the literature that have reported visual object representations in the dorsal pathway. Following another reviewer’s observation, we have also updated our analysis to better demonstrate the overlap in activation evoked by math and by geometry in the IPS, as well as include a novel comparison with independently published results.

      Overall, to address this point, we (i) show the overlap between our “geometry” contrast (shape > word+tools+houses) and our “math” contrast (number > words); (ii) we display these ROIs side by side with ROIs found in previous work (Amalric and Dehaene, 2016), and (iii) in each math-related ROIs reported in that article, we test our “geometry” (shape > word+tools+houses) contrast and find almost all of them to be significant in both population; see Fig. S5.

      Finally, within the ROIs identified with our geometry localizer, we also performed similarity analyses: for each region we extracted the betas of every voxel for every visual category, and estimated the distance (cross-validated mahalanobis) between different visual categories. In both ventral ROIs, in both populations, numbers were closer to shapes than to the other visual categories including text and Chinese characters (all p<.001). In adults, this result also holds for the right ITG (p=.021) and the left IPS (p=.014) but not the right IPS (p=.17). In children, this result did not hold in the areas.

      Naturally, overlap in brain activation does not suffice to conclude that the same computational processes are involved. We have added an explicit caveat about this point. Indeed, throughout the article,  we have been careful to frame our results in a way that is appropriate given our evidence, e.g. saying “Those areas are similar to those active during number perception, arithmetic, geometric sequences, and the processing of high-level math concepts” and “The IPS areas activated by geometric shapes overlap with those active during the comprehension of elementary as well as advanced mathematical concepts”. We have rephrased the possibly ambiguous “geometric shapes activated math- and number-related areas, particular the right aIPS.” into “geometric shapes activated areas independently found to be activated by math- and number-related tasks, in particular the right aIPS”.

      Reviewer #3 (Public review):

      Weakness:

      Perhaps the manuscript could emphasize that the areas recruited by geometric figures but not objects are spatial, with reduced processing in visual areas. It also seems important to say that the images of real objects are interpreted as representations of 3D objects, as they activate the same visual areas as real objects. By contrast, the images of geometric forms are not interpreted as representations of real objects but rather perhaps as 2D abstractions.

      This is an interesting possibility. Geometric shapes are likely to draw attention to spatial dimensions (e.g. length) and to do so in a 2D spatial frame of reference rather than the 3D representations evoked by most other objects or images. However, this possibility would require further work to be thoroughly evaluated, for instance by comparing usual 3D objects with rare instances of 2D ones (e.g. a sheet of paper, a sticker etc). In the absence of such a test, we refrained from further speculation on this point.

      The authors use the term "symbolic." That use of that term could usefully be expanded here.  

      The reviewer is right in pointing out that “symbolic” should have been more clearly defined. We now added in the introduction:

      (introduction) “[…] we sometimes refer to this model as “symbolic” because it relies on discrete, exact, rule-based features rather than continuous representations  (Sablé-Meyer et al., 2022). In this representational format, geometric shapes are postulated to be represented by symbolic expressions in a “language-of-thought”, e.g. “a square is a four-sided figure with four equal sides and four right angles” or equivalently by a computer-like program from drawing them in a Logo-like language (Sablé-Meyer et al., 2022).”

      Here, however, the present experiments do not directly probe this format of a representation. We have therefore simplified our wording and removed many of our use of the word “symbolic” in favor of the more specific “geometric features”.

      Pigeons have remarkable visual systems. According to my fallible memory, Herrnstein investigated visual categories in pigeons. They can recognize individual people from fragments of photos, among other feats. I believe pigeons failed at geometric figures and also at cartoon drawings of things they could recognize in photos. This suggests they did not interpret line drawings of objects as representations of objects.

      The comparison of geometric abilities across species is an interesting line of research. In the discussion, we briefly mention several lines of research that indicate that non-human primates do not perceive geometric shapes in the same way as we do – but for space reasons, we are reluctant to expand this section to a broader review of other more distant species. The referee is right that there is evidence of pigeons being able to perceive an invariant abstract 3D geometric shape in spite of much variation in viewpoint (Peissig et al., 2019) – but there does not seem to be evidence that they attend to geometric regularities specifically (e.g. squares versus non-squares). Also, the referee’s point bears on the somewhat different issue of whether humans and other animals may recognize the object depicted by a symbolic drawing (e.g. a sketch of a tree). Again, humans seem to be vastly superior in this domain, and research on this topic is currently ongoing in the lab. However, the point that we are making in the present work is specifically about the neural correlates of the representation of simple geometric shapes which by design were not intended to be interpretable as representations of objects.

      Categories are established in part by contrast categories; are quadrilaterals, triangles, and circles different categories?

      We are not sure how to interpret the referee’s question, since it bears on the definition of “category” (Spontaneous? After training? With what criterion?). While we are not aware of data that can unambiguously answer the reviewer’s question, categorical perception in geometric shapes can be inferred from early work investigating pop-out effects in visual search, e.g. (Treisman and Gormican, 1988): curvature appears to generate strong pop-out effects, and therefore we would expect e.g. circles to indeed be a different category than, say, triangles. Similarly, right angles, as well as parallel lines, have been found to be perceived categorically (Dillon et al., 2019).

      This suggests that indeed squares would be perceived as categorically different from triangles and circles. On the other hand, in our own previous work (Sablé-Meyer et al., 2021) we have found that the deviants that we generated from our quadrilaterals did not pop out from displays of reference quadrilaterals. Pop-out is probably not the proper criterion for defining what a “category” is, but this is the extent to which we can provide an answer to the reviewer’s question.

      It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e.g., table tops or cartons under various projections, or balls or buildings that are rectangular or triangular. Building parts, inside and out. like corners. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes, and internal texture. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features.

      We agree that there is a whole line of potential research here. We decided to start by focusing on the simplest set of geometric shapes that would give us enough variation in geometric regularity while being easy to match on other visual features. We agree with the reviewer that our results should hold both for more complex 2-D shapes, but also for 3-D shapes. Indeed, generative theories of shapes in higher dimensions following similar principles as ours have been devised (I. Biederman, 1987; Leyton, 2003).  We now mention this in the discussion:

      “Finally, this research should ultimately be extended to the representation of 3-dimensional geometric shapes, for which similar symbolic generative models have indeed been proposed (Irving Biederman, 1987; Leyton, 2003).”

      The feature space of geometry is more than parallelism and symmetry; angles are important, for example. Listing and testing features would be fascinating. Similarly, looking at younger or preferably non-Western children, as Western children are exposed to shapes in play at early ages.

      We agree with the reviewer on all point. While we do not list and test the different properties separately in this work, we would like to highlight that angles are part of our geometric feature model, which includes features of “right-angle” and “equal-angles” as suggested by the reviewer.

      We also agree about the importance of testing populations with limited exposure to formal training with geometric shapes. This was in fact a core aspect of a previous article of ours which tests both preschoolers, and adults with no access to formal western education – though no non-Western children (Sablé-Meyer et al., 2021). It remains a challenge to perform brain-imaging studies in non-Western populations (although see Dehaene et al., 2010; Pegado et al., 2014).

      What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to processing geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggests that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication and construction as well as use? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively, not qualitatively), and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.

      We refrained from speculating about this point in the previous version of the article, but share some of the reviewers’ intuitions about the underlying drive for geometric abstraction. As described in (Dehaene, 2026; Sablé-Meyer et al., 2022), our hypothesis, which isn’t tested in the present article, is that the emergence of a pervasive ability to represent aspects of the world as compact expressions in a mental “language-of-thought” is what underlies many domains of specific human competence, including some listed by the reviewer (tool construction, scene understanding) and our domain of study here, geometric shapes.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I enjoyed reading this paper. It is clearly written and nicely showcases the amount of work that has gone into conducting all these experiments and analyzing the data in sophisticated ways. I also thought the figures were great, and I liked the level of organization in the GitHub repository and am looking forward to seeing the shared data on OpenNeuro. I have some specific questions I hope the authors can address.

      (1) Behavior

      - Looking at Figure 1, it seemed like most shapes are clustering together, whereas square, rectangle, and maybe rhombus and parallelogram are slightly more unique. I was wondering whether the authors could comment on the potential influence of linguistic labels. Is it possible that it is easier to discard the intruder when the shapes are readily nameable versus not?

      This is an interesting observation, but the existence of names for shapes does not suffice to explain all of our findings ; see our reply to the public comment.

      (2) fMRI

      - As mentioned in the public review, I was surprised that the authors went with an intruder task because I would imagine that performance depends on the specific combination of geometric shapes used within a trial. I assume it is much harder to find, for example, a "Right Hinge" embedded within "Hinge" stimuli than a "Right Hinge" amongst "Squares". In addition, the rotation and scaling of each individual item should affect regular shapes less than irregular shapes, creating visual dissimilarities that would presumably make the task harder. Can the authors comment on how we can be sure that the differences we pick up in the parietal areas are not related to task difficulty but are truly related to geometric shape regularities?

      Again, please see our public review response for a larger discussion of the impact of task difficulty. There are two aspects to answering this question.

      First, the task is not as the reviewer describes: the intruder task is to find a deviant shape within several slightly rotated and scaled versions of the regular shape it came from. During brain imaging, we did not ask participants to find an exemplar of one of our reference shape amidst copies of another, but rather a deviant version of one shape against copies of its reference version. We only used this intruder task with all pairs of shapes to generate the behavioral RSA matrix.

      Second, we agree that some of the fMRI effect may stem from task difficulty, and this motivated our use of RSA analysis in fMRI, and a passive MEG task. RSA results cannot be explained by task difficulty.

      Overall, we have tried to make the limitations of the fMRI design, and the motivation for turning to passive presentation in MEG, clearer by stating the issues more clearly when we introduce experiment 4:

      “The temporal resolution of fMRI does not allow to track the dynamic of mental representations over time. Furthermore, the previous fMRI experiment suffered from several limitations. First, we studied six quadrilaterals only, compared to 11 in our previous behavioral work. Second, we used an explicit intruder detection, which implies that the geometric regularity effect was correlated with task difficulty, and we cannot exclude that this factor alone explains some of the activations in figure 3C (although it is much less clear how task difficulty alone would explain the RSA results in figure 3D). Third, the long display duration, which was necessary for good task performance especially in children, afforded the possibility of eye movements, which were not monitored inside the 3T scanner and again could have affected the activations in figure 3C.”

      - How far in the periphery were the stimuli presented? Was eye-tracking data collected for the intruder task? Similar to the point above, I would imagine that a harder trial would result in more eye movements to find the intruder, which could drive some of the differences observed here.

      A 1-degree bar was added to Figure 3A, which faithfully illustrates how the stimuli were presented in fMRI. Eye-tracking data was not collected during fMRI. Although the participants were explicitly instructed to fixate at the center of the screen and avoid eye movements, we fully agree with the referee that we cannot exclude that eye movements were present, perhaps more so for more difficult displays, and would therefore have contributed to the observed fMRI activations in experiment 3 (figure 3C). We now mention this limitation explicity at the end of experiment 3. However, crucially, this potential problem cannot apply to the MEG data. During the MEG task, the stimuli were presented one by one at the center of screen, without any explicit task, thus avoiding issues of eye movements. We therefore consider the MEG geometrical regularity effect, which comes at a relatively early latency (starting at ~160 ms) and even in a passive task, to provide the strongest evidence of geometric coding, unaffected by potential eye movement artefacts. 

      - I was wondering whether the authors would consider showing some un-thresholded maps just to see how widespread the activation of the geometric shapes is across all of the cortex.

      We share the uncorrected threshold maps in Fig. S3. for both adults and children in the category localizer, copied here as well. For the geometry task, most of the clusters identified are fairly big and survive cluster-corrected permutations; the uncorrected statistical maps look almost fully identical to the one presented in Fig. 3 (p<.001 map).

      - I'm missing some discussion on the role of early visual areas that goes beyond the RSA-CNN comparison. I would imagine that early visual areas are not only engaged due to top-down feedback (line 258) but may actually also encode some of the geometric features, such as parallel lines and symmetry. Is it feasible to look at early visual areas and examine what the similarity structure between different shapes looks like?

      If early visual areas encoded the geometric features that we propose, then even early sensor-level RSA matrices should show a strong impact of geometric features similarity, which is not what we find (figure 4D). We do, however, appreciate the referee’s request to examine more closely how this similarity structure looks like. We now provide a movie showing the significant correlation between neural activity and our two models (uncorrected participants); indeed, while the early occipital activity (around 110ms) is dominated by a significant correlation with the CNN model, there are also scattered significant sources associated to the symbolic model around these timepoints already.

      To test this further, we used beamformers to reconstruct the source-localized activity in calcarine cortex and performed an RSA analysis across that ROI. We find that indeed the CNN model is strongly significant at t=110ms (t=3.43, df=18, p=.003) while the geometric feature model is not (t=1.04, df=18, p=.31), and the CNN is significantly above the geometric feature model (t=4.25, df=18, p<.001). However, this result is not very stable across time, and there are significant temporal clusters around these timepoints associated to each model, with no significant cluster associated to a CNN > geometric (CNN: significant cluster from 88ms to 140ms, p<.001 in permutation based with 10000 permutations; geometric features has a significant cluster from 80ms to 104ms, p=.0475; no significant cluster on the difference between the two).

      (3) MEG

      - Similar to the fMRI set, I am a little worried that task difficulty has an effect on the decoding results, as the oddball should pop out more in more geometric shapes, making it easier to detect and easier to decode. Can the authors comment on whether it would matter for the conclusions whether they are decoding varying task difficulty or differences in geometric regularity, or whether they think this can be considered similarly?

      See above for an extensive discussion of the task difficulty effect. We point out that there is no task in the MEG data collection part. We have clarified the task design by updating our Fig. 4. Additionally, the fact that oddballs are more perceived more or less easily as a function of their geometric regularity is, in part, exactly the point that we are making – but, in MEG, even in the absence of a task of looking for them.

      - The authors discuss that the inflated baseline/onset decoding/regression estimates may occur because the shapes are being repeated within a mini-block, which I think is unlikely given the long ISIs and the fact that the geometric features model is not >0 at onset. I think their second possible explanation, that this may have to do with smoothing, is very possible. In the text, it said that for the non-smoothed result, the CNN encoding correlates with the data from 60ms, which makes a lot more sense. I would like to encourage the authors to provide readers with the unsmoothed beta values instead of the 100-ms smoothed version in the main plot to preserve the reason they chose to use MEG - for high temporal resolution!

      We fully agree with the reviewer and have accordingly updated the figures to show the unsmoothed data (see below). Indeed, there is now no significant CNN effect before ~60 ms (up to the accuracy of identifying onsets with our method).

      - In Figure 4C, I think it would be useful to either provide error bars or show variability across participants by plotting each participant's beta values. I think it would also be nice to plot the dissimilarity matrices based on the MEG data at select timepoints, just to see what the similarity structure is like.

      Following the reviewer’s recommendation, we plot the timeseries with SEM as shaded area, and thicker lines for statistically significant clusters, and we provide the unsmoothed version in figure Fig. 4. As for the dissimilarity matrices at select timepoints, this has now been added to figure Fig. 4.

      - To evaluate the source model reconstruction, I think the reader would need a little more detail on how it was done in the main text. How were the lead fields calculated? Which data was used to estimate the sources? How are the models correlated with the source data?

      We have imported some of the details in the main text as follows (as well as expanding the methods section a little):

      “To understand which brain areas generated these distinct patterns of activations, and probe whether they fit with our previous fMRI results, we performed a source reconstruction of our data. We projected the sensor activity onto each participant's cortical surfaces estimated from T1-images. The projection was performed using eLORETA and emptyroom recordings acquired on the same day to estimate noise covariance, with the default parameters of mne-bids-pipeline. Sources were spaced using a recursively subdivided octahedron (oct5). Group statistics were performed after alignement to fsaverage. We then replicated the RSA analysis […]”

      - In addition to fitting the CNN, which is used here to model differences in early visual cortex, have the authors considered looking at their fMRI results and localizing early visual regions, extracting a similarity matrix, and correlating that with the MEG and/or comparing it with the CNN model?

      We had ultimately decided against comparing the empirical similarity matrices from the MEG and fMRI experiments, first because the stimuli and tasks are different, and second because this would not be directly relevant to our goal, which is to evaluate whether a geometric-feature model accounts for the data. Thus, we systematically model empirical similarity matrices from fMRI and from MEG with our two models derived from different theories of shape perception in order to test predictions about their spatial and temporal dynamic. As for comparing the similarity matrix from early visual regions in fMRI with that predicted by the CNN model, this is effectively visible from our Fig. 3D where we perform searchlight RSA analysis and modeling with both the CNN and the geometric feature model; bilaterally, we find a correlation with the CNN model, although it sometimes overlap with predictions from the geometric feature model as well. We now include a section explaining this reasoning in appendix:

      “Representational similarity analysis also offers a way to directly compared similarity matrices measured in MEG and fMRI, thus allowing for fusion of those two modalities and tentatively assigning a “time stamp” to distinct MRI clusters. However, we did not attempt such an analysis here for several reasons. First, distinct tasks and block structures were used in MEG and fMRI. Second, a smaller list of shapes was used in fMRI, as imposed by the slower modality of acquisition. Third, our study was designed as an attempt to sort out between two models of geometric shape recognition. We therefore focused all analyses on this goal, which could not have been achieved by direct MEG-fMRI fusion, but required correlation with independently obtained model predictions.”

      Minor comments

      - It's a little unclear from the abstract that there is children's data for fMRI only.

      We have reworded the abstract to make this unambiguous

      - Figures 4a & b are missing y-labels.

      We can see how our labels could be confused with (sub-)plot titles and have moved them to make the interpretation clearer.

      - MEG: are the stimuli always shown in the same orientation and size?

      They are not, each shape has a random orientation and scaling. On top of a task example at the top of Fig. 4, we have now included a clearer mention of this in the main text when we introduce the task:

      “shapes were presented serially, one at a time, with small random changes in rotation and scaling parameters, in miniblocks with a fixed quadrilateral shape and with rare intruders with the bottom right corner shifted by a fixed amount (Sablé-Meyer et al., 2021)”

      - To me, the discussion section felt a little lengthy, and I wonder whether it would benefit from being a little more streamlined, focused, and targeted. I found that the structure was a little difficult to follow as it went from describing the result by modality (behavior, fMRI, MEG) back to discussing mostly aspects of the fMRI findings.

      We have tried to re-organize and streamline the discussion following these comments.

      Then, later on, I found that especially the section on "neurophysiological implementation of geometry" went beyond the focus of the data presented in the paper and was comparatively long and speculative.

      We have reexamined the discussion, but the citation of papers emphasizing a representation of non-accidental geometric properties in non-human animals was requested by other commentators on our article; and indeed, we think that they are relevant in the context of our prior suggestion that the composition of geometric features might be a uniquely human feature – these papers suggest that individual features may not, and that it is therefore compositionality which might be special to the human brain. We have nevertheless shortened it.

      Furthermore, we think that this section is important because symbolic models are often criticized for lack of a plausible neurophysiological implementation. It is therefore important to discuss whether and how the postulated symbolic geometric code could be realized in neural circuits. We have added this justification to the introduction of this section.

      Reviewer #2 (Recommendations for the authors):

      (1) If the authors want to specifically claim that their findings align with mathematical reasoning, they could at least show the overlap between the activation maps of the current study and those from prior work.

      This was added to the fMRI results. See our answers to the public review.

      (2) I wonder if the reason the authors only found aIPS in their first analysis (Figure 2) is because they are contrasting geometric shapes with figures that also have geometric properties. In other words, faces, objects, and houses also contain geometric shape information, and so the authors may have essentially contrasted out other areas that are sensitive to these features. One indication that this may be the case is that the geometric regularity effect and searchlight RSA (Figure 3) contains both anterior and posterior IPS regions (but crucially, little ventral activity). It might be interesting to discuss the implications of these differences.

      Indeed, we cannot exclude that the few symmetries, perpendicularity and parallelism cues that can be presented in faces, objects or houses were processed as such, perhaps within the ventral pathway, and that these representations would have been subtracted out. We emphasize that our subtraction isolates the geometrical features that are present in simple regular geometric shapes, over and above those that might exist in other categories. We have added this point to the discussion:

      “[… ] For instance, faces possess a plane of quasi-symmetry, and so do many other man-made tools and houses. Thus, our subtraction isolated the geometrical features that are present in simple regular geometric shapes (e.g. parallels, right angles, equality of length) over and above those that might already exist, in a less pure form, in other categories.”

      (3) I had a few questions regarding the MEG results.

      a. I didn't quite understand the task. What is a regular or oddball shape in this context? It's not clear what is being decoded. Perhaps a small example of the MEG task in Figure 4 would help?

      We now include an additional sub-figure in Fig. 4 to explain the paradigm. In brief: there is no explicit task, participants are simply asked to fixate. The shapes come in miniblocks of 30 identical reference shapes (up to rotation and scaling), among which some occasional deviant shapes randomly appear (created by moving the corner of the reference shape by some amount).

      b. In Figure 4A/B they describe the correlation with a 'symbolic model'. Is this the same as the geometric model in 4C?

      It is. We have removed this ambiguity by calling it “geometric model” and setting its color to the one associated to this model thought the article.

      c. The author's explanation for why geometric feature coding was slower than CNN encoding doesn't quite make sense to me. As an explanation, they suggest that previous studies computed "elementary features of location or motor affordance", whereas their study work examines "high-level mathematical information of an abstract nature." However, looking at the studies the authors cite in this section, it seems that these studies also examined the time course of shape processing in the dorsal pathway, not "elementary features of location or motor affordance." Second, it's not clear how the geometric feature model reflects high-level mathematical information (see point above about claiming this is related to math).

      We thank the referee for pointing out this inappropriate phrase, which we removed. We rephrased the rest of the paragraph to clarify our hypothesis in the following way:

      “However, in this work, we specifically probed the processing of geometric shapes that, if our hypothesis is correct, are represented as mental expressions that combine geometrical and arithmetic features of an abstract categorical nature, for instance representing “four equal sides” or “four right angles”. It seems logical that such expressions, combining number, angle and length information, take more time to be computed than the first wave of feedforward processing within the occipito-temporal visual pathway, and therefore only activate thereafter.”

      One explanation may be that the authors' geometric shapes require finer-grained discrimination than the object categories used in prior studies. i.e., the odd-ball task may be more of a fine-grained visual discrimination task. Indeed, it may not be a surprise that one can decode the difference between, say, a hammer and a butterfly faster than two kinds of quadrilaterals.

      We do not disagree with this intuition, although note that we do not have data on this point (we are reporting and modelling the MEG RSA matrix across geometric shapes only – in this part, no other shapes such as tools or faces are involved). Still, the difference between squares, rectangles, parallelograms and other geometric shapes in our stimuli is not so subtle. Furthermore, CNNs do make very fine grained distinctions, for instance between many different breeds of dogs in the IMAGENET corpus. Still, those sorts of distinctions capture the initial part of the MEG response, while the geometric model is needed only for the later part. Thus, we think that it is a genuine finding that geometric computations associated with the dorsal parietal pathway are slower than the image analysis performed by the ventral occipito-temporal pathway.

      d. CNN encoding at time 0 is a little weird, but the author's explanation, that this is explained by the fact that temporal smoothed using a 100 ms window makes sense. However, smoothing by 100 ms is quite a lot, and it doesn't seem accurate to present continuous time course data when the decoding or RSA result at each time point reflects a 100 ms bin. It may be more accurate to simply show unsmoothed data. I'm less convinced by the explanation about shape prediction.

      We agree. Following the reviewer’s advice, as well as the recommendation from reviewer 1, we now display unsmoothed plots, and the effects now exhibit a more reasonable timing (Figure 4D), with effects starting around ~60 ms for CNN encoding.

      (4) I appreciate the author's use of multiple models and their explanation for why DINOv2 explains more variance than the geometric and CNN models (that it represents both types of features. A variance partitioning analysis may help strengthen this conclusion (Bonner & Epstein, 2018; Lescroart et al., 2015).

      However, one difference between DINOv2 and the CNN used here is that it is trained on a dataset of 142 million images vs. the 1.5 million images used in ImageNet. Thus, DINOv2 is more likely to have been exposed to simple geometric shapes during training, whereas standard ImageNet trained models are not. Indeed, prior work has shown that lesioning line drawing-like images from such datasets drastically impairs the performance of large models (Mayilvahanan et al., 2024). Thus, it is unlikely that the use of a transformer architecture explains the performance of DINOv2. The authors could include an ImageNet-trained transformer (e.g., ViT) and a CNN trained on large datasets (e.g., ResNet trained on the Open Clip dataset) to test these possibilities. However, I think it's also sufficient to discuss visual experience as a possible explanation for the CNN and DINOv2 results. Indeed, young children are exposed to geometric shapes, whereas ImageNet-trained CNNs are not.

      We agree with the reviewer’s observation. In fact, new and ongoing work from the lab is also exploring this; we have included in supplementary materials exactly what the reviewer is suggesting, namely the time course of the correlation with ViT and with ConvNeXT. In line with the reviewers’ prediction, these networks, trained on much larger dataset and with many more parameters, can also fit the human data as well as DINOv2. We ran additional analysis of the MEG data with ViT and ConvNeXT, which we now report in Fig. S6 as well as in an additional sentence in that section:

      “[…] similar results were obtained by performing the same analysis, not only with another vision transformer network, ViT, but crucially using a much larger convolutional neural network, ConvNeXT, which comprises ~800M parameters and has been trained on 2B images, likely including many geometric shapes and human drawings. For the sake of completeness, RSA analysis in sensor space of the MEG data with these two models is provided in Fig. S6.”

      We conclude that the size and nature of the training set could be as important as the architecture – but also note that humans do not rely on such a huge training set. We have updated the text, as well as Fig. S6, accordingly by updating the section now entitled “Vision Transformers and Larger Neural Networks”, and the discussion section on theoretical models.

      (5) The authors may be interested in a recent paper from Arcaro and colleagues that showed that the parietal cortex is greatly expanded in humans (including infants) compared to non-human primates (Meyer et al., 2025), which may explain the stronger geometric reasoning abilities of humans.

      A very interesting article indeed! We have updated our article to incorporate this reference in the discussion, in the section on visual pathways, as follows:

      “Finally, recent work shows that within the visual cortex, the strongest relative difference in growth between human and non-human primates is localized in parietal areas (Meyer et al., 2025). If this expansion reflected the acquisition of new processing abilities in these regions, it  might explain the observed differences in geometric abilities between human and non-human primates (Sablé-Meyer et al., 2021).”

      Also, the authors may want to include this paper, which uses a similar oddity task and compelling shows that crows are sensitive to geometric regularity:

      Schmidbauer, P., Hahn, M., & Nieder, A. (2025). Crows recognize geometric regularity. Science Advances, 11(15), eadt3718. https://doi.org/10.1126/sciadv.adt3718

      We have ongoing discussions with the authors of this work and are  have prepared a response to their findings (Sablé-Meyer and Dehaene, 2025)–ultimately, we think that this discussion, which we agree is important, does not have its place in the present article. They used a reduced version of our design, with amplified differences in the intruders. While they did not test the fit of their model with CNN or geometric feature models, we did and found that a simple CNN suffices to account for crow behavior. Thus, we disagree that their conclusions follow from their results and their conclusions. But the present article does not seem to be the right platform to engage in this discussion.

      References

      Ayzenberg, V., & Behrmann, M. (2022). The Dorsal Visual Pathway Represents Object-Centered Spatial Relations for Object Recognition. The Journal of Neuroscience, 42(23), 4693-4710. https://doi.org/10.1523/jneurosci.2257-21.2022

      Bonner, M. F., & Epstein, R. A. (2018). Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computational Biology, 14(4), e1006111. https://doi.org/10.1371/journal.pcbi.1006111

      Bueti, D., & Walsh, V. (2009). The parietal cortex and the representation of time, space, number and other magnitudes. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1525), 1831-1840.

      Dehaene, S., & Brannon, E. (2011). Space, time and number in the brain: Searching for the foundations of mathematical thought. Academic Press.

      Freud, E., Culham, J. C., Plaut, D. C., & Bermann, M. (2017). The large-scale organization of shape processing in the ventral and dorsal pathways. eLife, 6, e27576.

      Freud, E., Ganel, T., Shelef, I., Hammer, M. D., Avidan, G., & Behrmann, M. (2017). Three-dimensional representations of objects in dorsal cortex are dissociable from those in ventral cortex. Cerebral Cortex, 27(1), 422-434.

      Freud, E., Plaut, D. C., & Behrmann, M. (2016). 'What 'is happening in the dorsal visual pathway. Trends in Cognitive Sciences, 20(10), 773-784.

      Freud, E., Plaut, D. C., & Behrmann, M. (2019). Protracted developmental trajectory of shape processing along the two visual pathways. Journal of Cognitive Neuroscience, 31(10), 1589-1597.

      Han, Z., & Sereno, A. (2022). Modeling the Ventral and Dorsal Cortical Visual Pathways Using Artificial Neural Networks. Neural Computation, 34(1), 138-171. https://doi.org/10.1162/neco_a_01456

      Janssen, P., Srivastava, S., Ombelet, S., & Orban, G. A. (2008). Coding of shape and position in macaque lateral intraparietal area. Journal of Neuroscience, 28(26), 6679-6690.

      Konen, C. S., & Kastner, S. (2008). Two hierarchically organized neural systems for object information in human visual cortex. Nature Neuroscience, 11(2), 224-231.

      Lescroart, M. D., Stansbury, D. E., & Gallant, J. L. (2015). Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Frontiers in Computational Neuroscience, 9(135), 1-20. https://doi.org/10.3389/fncom.2015.00135

      Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., & Brendel, W. (2024). In search of forgotten domain generalization. arXiv Preprint arXiv:2410.08258.

      Meyer, E. E., Martynek, M., Kastner, S., Livingstone, M. S., & Arcaro, M. J. (2025). Expansion of a conserved architecture drives the evolution of the primate visual cortex. Proceedings of the National Academy of Sciences, 122(3), e2421585122. https://doi.org/10.1073/pnas.2421585122

      Orban, G. A. (2011). The extraction of 3D shape in the visual system of human and nonhuman primates. Annual Review of Neuroscience, 34, 361-388.

      Romei, V., Driver, J., Schyns, P. G., & Thut, G. (2011). Rhythmic TMS over Parietal Cortex Links Distinct Brain Frequencies to Global versus Local Visual Processing. Current Biology, 21(4), 334-337. https://doi.org/10.1016/j.cub.2011.01.035

      Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395(6701), 500-503. https://doi.org/10.1038/26752

      Summerfield, C., Luyckx, F., & Sheahan, H. (2020). Structure learning and the posterior parietal cortex. Progress in Neurobiology, 184, 101717. https://doi.org/10.1016/j.pneurobio.2019.101717

      Van Dromme, I. C., Premereur, E., Verhoef, B.-E., Vanduffel, W., & Janssen, P. (2016). Posterior Parietal Cortex Drives Inferotemporal Activations During Three-Dimensional Object Vision. PLoS Biology, 14(4), e1002445. https://doi.org/10.1371/journal.pbio.1002445

      Xu, Y. (2018). A tale of two visual systems: Invariant and adaptive visual information representations in the primate brain. Annu. Rev. Vis. Sci, 4, 311-336.

      Reviewer #3 (Recommendations for the authors):

      Bring into the discussion some of the issues outlined above, especially a) the spatial rather than visual of the geometric figures and b) the non-representational aspects of geometric form aspects.

      We thank the reviewer for their recommendations – see our response to the public review for more details.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Evidence, reproducibility and clarity

      This paper addresses a very interesting problem of non-centrosomal microtubule organization in developing Drosophila oocytes. Using genetics and imaging experiments, the authors reveal an interplay between the activity of kinesin-1, together with its essential cofactor Ensconsin, and microtubule organization at the cell cortex by the spectraplakin Shot, minus-end binding protein Patronin and Ninein, a protein implicated in microtubule minus end anchoring. The authors demonstrate that the loss of Ensconsin affects the cortical accumulation non-centrosomal microtubule organizing center (ncMTOC) proteins, microtubule length and vesicle motility in the oocyte, and show that this phenotype can be rescued by constitutively active kinesin-1 mutant, but not by Ensconsin mutants deficient in microtubule or kinesin binding. The functional connection between Ensconsin, kinesin-1 and ncMTOCs is further supported by a rescue experiment with Shot overexpression. Genetics and imaging experiments further implicate Ninein in the same pathway. These data are a clear strength of the paper; they represent a very interesting and useful addition to the field.

      The weaknesses of the study are two-fold. First, the paper seems to lack a clear molecular model, uniting the observed phenomenology with the molecular functions of the studied proteins. Most importantly, it is not clear how kinesin-based plus-end directed transport contributes to cortical localization of ncMTOCs and regulation of microtubule length.

      Second, not all conclusions and interpretations in the paper are supported by the presented data.

      We thank the reviewer for recognizing the impact of this work. In response to the insightful suggestions, we performed extensive new experiments that establish a well-supported cellular and molecular model (Figure 7). The discussion has been restructured to directly link each conclusion to its corresponding experimental evidence, significantly strengthening the manuscript.

      Below is a list of specific comments, outlining the concerns, in the order of appearance in the paper/figures.

      Figure 1. The statement: "Ens loading on MTs in NCs and their subsequent transport by Dynein toward ring canals promotes the spatial enrichment of the Khc activator Ens in the oocyte" is not supported by data. The authors do not demonstrate that Ens is actually transported from the nurse cells to the oocyte while being attached to microtubules. They do show that the intensity of Ensconsin correlates with the intensity of microtubules, that the distribution of Ensconsin depends on its affinity to microtubules and that an Ensconsin pool locally photoactivated in a nurse cell can redistribute to the oocyte (and throughout the nurse cell) by what seems to be diffusion. The provided images suggest that Ensconsin passively diffuses into the oocyte and accumulates there because of higher microtubule density, which depends on dynein. To prove that Ensconsin is indeed transported by dynein in the microtubule-bound form, one would need to measure the residence time of Ensconsin on microtubules and demonstrate that it is longer than the time needed to transport microtubules by dynein into the oocyte; ideally, one would like to see movement of individual microtubules labelled with photoconverted Ensconsin from a nurse cell into the oocyte. Since microtubules are not enriched in the oocyte of the dynein mutant, analysis of Ensconsin intensity in this mutant is not informative and does not reveal the mechanism of Ensconsin accumulation.

      As noted by Reviewer 3, the directional movement of microtubules traveling at ~140 nm/s from nurse cells toward the oocyte through Ring Canals was previously reported using a tagged Ens-MT binding domain reporter line by Lu et al. (2022). We have therefore added the citation of this crucial work in the novel version of the manuscript (lane 155-157) and removed the photo-conversion panel.

      Critically, however, our study provides mechanistic insight that was missing from this earlier work: this mechanism is also crucial to enrich MAPs in the oocyte. The fact that Dynein mutants fail to enrich Ensconsin is a crucial piece of evidence: it supports a model of Ensconsin-loaded MT transport (Figure 1D-1F).

      Figure 2. According to the abstract, this figure shows that Ensconsin is "maintained at the oocyte cortex by Ninein". However, the figure doesn't seem to prove it - it shows that oocyte enrichment of Ensonsin is partially dependent on Ninein, but this applies to the whole cell and not just to the cell cortex. Furthermore, it is not clear whether Ninein mutation affects microtubule density, which in turn would affect Ensconsin enrichment, and therefore, it is not clear whether the effect of Ninein loss on Ensconsin distribution is direct or indirect.

      Ninein plays a critical role in Ensconsin enrichment and microtubule organization in the oocyte (new Figure 2, Figure 3, Figure S3). Quantification of total Tubulin signal shows no difference between control and Nin mutant oocytes (new Figure S3 panels A, B). We found decreased Ens enrichment in the oocyte, and Ens localization on MTs and to the cell cortex (Figure 2E, 2F, and Figure S3C and S3D).

      Novel quantitative analyses of microtubule orientation at the anterior cortex, where MTs are normally preferentially oriented toward the posterior pole (Parton et al. 2011), demonstrate that Nin mutants exhibit randomized MT orientation compared to wild-type oocytes (new Figure 3C-3E).These findings establish that Ninein (although not essential) favors Ensconsin localization on MTs, Ens enrichment in the oocyte, ncMTOC cortical localization, and more robust MT orientation toward the posterior cortex. It also suggests that Ens levels in the oocyte acts as a rheostat to control Khc activation.

      The observation that the aggregates formed by overexpressed Ninein accumulate other proteins, including Ensconsin, supports, though does not prove their interactions. Furthermore, there is absolutely no proof that Ninein aggregates are "ncMTOCs". Unless the authors demonstrate that these aggregates nucleate or anchor microtubules (for example, by detailed imaging of microtubules and EB1 comets), the text and labels in the figure would need to be altered.

      We have modified the manuscript, we now refer to an accumulation of these components in large puncta, rather than aggregates, consistent with previous observations (Rosen et al., 2000). We acknowledge in the revised version that these puncta recruit Shot, Patronin and Ens without mentioning direct interaction (lane 218).

      Importantly, we conducted a more detailed characterization of these Ninein/Shot/Patronin/Ens-containing puncta in a novel Figure S4. To rigorously assess their nucleation capacity, we analyzed Eb1-GFP-labeled MT comets, a robust readout of MT nucleation (Parton et al., 2011, Nashchekin et al., 2016). While few Eb1-positive comets occasionally emanate from these structures, confirming their identity as putative ncMTOCs, these puncta function as surprisingly weak nucleation centers (new Figure S4 E, Video S1) and, their presence does not alter overall MT architecture (new Figure S4 F). Moreover, these puncta disappear over time, are barely visible at stage 10B, they do not impair oocyte development or fertility (Figure S4 G and Table 1).

      Minor comment: Note that a "ratio" (Figure 2C) is just a ratio, and should not be expressed in arbitrary units.

      We have amended this point in all the figures.

      Figure 3B: immunoprecipitation results cannot be interpreted because the immunoprecipitated proteins (GFP, Ens-GFP, Shot-YFP) are not shown. It is also not clear that this biochemical experiment is useful. If the authors would like to suggest that Ensconsin directly binds to Patronin, the interaction would need to be properly mapped at the protein domain level.

      This is a good point: the GFP and Ens-GFP immunoprecipitated proteins are now much clearly identified on the blots and in the figure legend (new Figure 4G). Shot-YFP IP, was used as a positive control but is difficult to be detected by Western blot due to its large size (>106 Da) using conventional acrylamide gels (Nashchekin et al., 2016).

      We now explicitly state that immunoprecipitations were performed at 4°C, where microtubules are fully depolymerized, thereby excluding undirect microtubule-mediated interactions. We agree with this reviewer: we cannot formally rule out interactions through bridging by other protein components. This is stated in the revised manuscript (lane 238-239).

      One of the major phenotypes observed by the authors in Ens mutant is the loss of long microtubules. The authors make strong conclusions about the independence of this phenotype from the parameters of microtubule plus-end growth, but in fact, the quality of their data does not allow to make such a conclusion, because they only measured the number of EB1 comets and their growth rate but not the catastrophe, rescue or pausing frequency."Note that kinesin-1 has been implicated in promoting microtubule damage and rescue (doi: 10.1016/j.devcel.2021).In the absence of such measurements, one cannot conclude whether short microtubules arise through defects in the minus-end, plus-end or microtubule shaft regulation pathways.

      We thank the reviewer for raising this important point. Our data demonstrate that microtubule (MT) nucleation and polymerization rates remain unaffected under Khc RNAi and ens mutant conditions, indicating that MT dynamics alterations must arise through alternative mechanisms.

      As the reviewer suggested, recent studies on Kinesin activity and MT network regulation are indeed highly relevant. Two key studies from the Verhey and Aumeier laboratories examined Kinesin-1 gain-of-function conditions and revealed that constitutively active Kinesin-1 induces MT lattice damage (Budaitis et al., 2022). While damaged MTs can undergo self-repair, Aumeier and colleagues demonstrated that GTP-tubulin incorporation generates "rescue shafts" that promote MT rescue events (Andreu-Carbo et al., 2022). Extrapolating from these findings, loss of Kinesin-1 activity could plausibly reduce rescue shaft formation, thereby decreasing MT rescue frequency and stability. Although this hypothesis is challenging to test directly in our system, it provides a mechanistic framework for the observed reduction in MT number and stability.

      Additionally, the reviewer highlighted the role of Khc in transporting the dynactin complex, an anti-catastrophe factor, to MT plus ends (Nieuwburg et al., 2017), which could further contribute to MT stabilization. This crucial reference is now incorporated into the revised Discussion.

      Importantly, our work also demonstrates the contribution of Ens/Khc to ncMTOC targeting to the cell cortex. Our new quantitative analyses of MT organization (new Figure 5 B) reveal a defective anteroposterior orientation of cortical MTs in mutant conditions, pointing to a critical role for cortical ncMTOCs in organizing the MT network.

      Taken together, we propose that the observed MT reduction and disorganization result from multiple interconnected mechanisms: (1) reduced rescue shaft formation affecting MT stability; (2) impaired transport of anti-catastrophe factors to MT plus ends; and (3) loss of cortical ncMTOCs, which are essential for minus-end MT stabilization and network organization. The Discussion has been revised to reflect this integrated model in a dedicated paragraph (“A possible regulation of MT dynamics in the oocyte at both plus end minus MT ends by Ens and Khc” lane 415-432).

      It is important to note in that a spectraplakin, like Shot, can potentially affect different pathways, particularly when overexpressed.

      We agree that Shot harbors multiple functional domains and acts as a key organizer of both actin and microtubule cytoskeletons. Overexpression of such a cytoskeletal cross-linker could indeed perturb both networks, making interpretation of Ens phenotype rescue challenging due to potential indirect effects.

      To address this concern, we selected an appropriate Shot isoform for our rescue experiments that displayed similar localization to “endogenous” Shot-YFP (a genomic construct harboring shot regulatory sequences) and importantly that was not overexpressed.

      Elevated expression of the Shot.L(A) isoform (see Western Blot Figure S8 A), considered as the wild-type form with two CH1 and CH2 actin-binding motifs (Lee and Kolodziej, 2002), showed abnormal localization such as strong binding to the microtubules in nurse cells and oocyte confirming the risk of gain-of-function artifacts and inappropriate conclusions (Figure S8 B, arrows).

      By contrast, our rescue experiments using the Shot.L(C) isoform (that only harbors the CH2 motif) provide strong evidence against such artifacts for three reasons. First, Shot-L(C) is expressed at slightly lower levels than a Shot-YFP genomic construct (not overexpressed), and at much lower levels than Shot-L(A), despite using the same driver (Figure S8 A). Second, Shot-L(C) localization in the oocyte is similar to that of endogenous Shot-YFP, concentrating at the cell cortex (Figure S8 B, compare lower and top panels). Taken together, these controls rather suggest our rescue with the Shot-L(C) is specific.

      Note that this Shot-L(C) isoform is sufficient to complement the absence of the shot gene in other cell contexts (Lee and Kolodziej, 2002).

      Unjustified conclusions should be removed: the authors do not provide sufficient data to conclude that "ens and Khc oocytes MT organizational defects are caused by decreased ncMTOC cortical anchoring", because the actual cortical microtubule anchoring was not measured.

      This is a valid point. We acknowledge that we did not directly measure microtubule anchoring in this study. In response, we have revised the discussion to more accurately reflect our observations. Throughout the manuscript, we now refer to "cortical microtubule organization" rather than "cortical microtubule anchoring," which better aligns with the data presented.

      Minor comment: Microtubule growth velocity must be expressed in units of length per time, to enable evaluating the quality of the data, and not as a normalized value.

      This is now amended in the revised version (modified Figure S7).

      A significant part of the Discussion is dedicated to the potential role of Ensconsin in cortical microtubule anchoring and potential transport of ncMTOCs by kinesin. It is obviously fine that the authors discuss different theories, but it would be very helpful if the authors would first state what has been directly measured and established by their data, and what are the putative, currently speculative explanations of these data.

      We have carefully considered the reviewer's constructive comments and are confident that this revised version fully addresses their concerns.

      First, we have substantially strengthened the connection between the Results and Discussion sections, ensuring that our interpretations are more directly anchored in the experimental data. This restructuring significantly improves the overall clarity and logical flow of the manuscript.

      Second, we have added a new comprehensive figure presenting a molecular-scale model of Kinesin-1 activation upon release of autoinhibition by Ensconsin (new Figure 7D). Critically, this figure also illustrates our proposed positive feedback loop mechanism: Khc-dependent cytoplasmic advection promotes cortical recruitment of additional ncMTOCs, which generates new cortical microtubules and further accelerates cytoplasmic transport (Figure 7 A-C). This self-amplifying cycle provides a mechanistic framework consistent with emerging evidence that cytoplasmic flows are essential for efficient intracellular transport in both insect and mammalian oocytes.

      Minor comment: The writing and particularly the grammar need to be significantly improved throughout, which should be very easy with current language tools. Examples: "ncMTOCs recruitment" should be "ncMTOC recruitment"; "Vesicles speed" should be "Vesicle speed", "Nin oocytes harbored a WT growth,"- unclear what this means, etc. Many paragraphs are very long and difficult to read. Making shorter paragraphs would make the authors' line of thought more accessible to the reader.

      We have amended and shortened the manuscript according to this reviewer feed-back. We have specifically built more focused paragraphs to facilitates the reading.

      Significance

      This paper represents significant advance in understanding non-centrosomal microtubule organization in general and in developing Drosophila oocytes in particular by connecting the microtubule minus-end regulation pathway to the Kinesin-1 and Ensconsin/MAP7-dependent transport. The genetics and imaging data are of good quality, are appropriately presented and quantified. These are clear strengths of the study which will make it interesting to researchers studying the cytoskeleton, microtubule-associated proteins and motors, and fly development.

      The weaknesses of this study are due to the lack of clarity of the overall molecular model, which would limit the impact of the study on the field. Some interpretations are not sufficiently supported by data, but this can be solved by more precise and careful writing, without extensive additional experimentation.

      We thank the reviewer for raising these important concerns regarding clarity and data interpretation. We have thoroughly revised the manuscript to address these issues on multiple fronts. First, we have substantially rewritten key sections to ensure that our conclusions are clearly articulated and directly supported by the data. Second, we have performed several new experiments that now allow us to propose a robust mechanistic model, presented in new figures. These additions significantly strengthen the manuscript and directly address the reviewer's concerns.

      My expertise is cell biology and biochemistry of the microtubule cytoskeleton, including both microtubule-associated proteins and microtubule motors.

      Reviewer #2

      Evidence, reproducibility and clarity

      In this manuscript, Berisha et al. investigate how microtubule (MT) organization is spatially regulated during Drosophila oogenesis. The authors identify a mechanism in which the Kinesin-1 activator Ensconsin/MAP7 is transported by dynein and anchored at the oocyte cortex via Ninein, enabling localized activation of Kinesin-1. Disruption of this pathway impairs ncMTOC recruitment and MT anchoring at the cortex. The authors combine genetic manipulation with high-resolution microscopy and use three key readouts to assess MT organization during mid-to-late oogenesis: cortical MT formation, localization of posterior determinants, and ooplasmic streaming. Notably, Kinesin-1, in concert with its activator Ens/MAP7, contributes to organizing the microtubule network it travels along. Overall, the study presents interesting findings, though we have several concerns we would like the authors to address. Ensconsin enrichment in the oocyte 1. Enrichment in the oocyte • Ensconsin is a MAP that binds MTs. Given that microtubule density in the oocyte significantly exceeds that in the nurse cells, its enrichment may passively reflect this difference. To assess whether the enrichment is specific, could the authors express a non-Drosophila MAP (e.g., mammalian MAP1B) to determine whether it also preferentially localizes to the oocyte?

      To address this point, we performed a new series of experiments analyzing the enrichment of other Drosophila and non-Drosophila MAPs, including Jupiter-GFP, Eb1-GFP, and bovine Tau-GFP, all widely used markers of the microtubule cytoskeleton in flies (see new Figure S2). Our results reveal that Jupiter-GFP, Eb1-GFP, and bovine Tau-GFP all exhibit significantly weaker enrichment in the oocyte compared to Ens-GFP. Khc-GFP also shows lower enrichment. These findings indicate that MAP enrichment in the oocyte is MAP-dependent, rather than solely reflecting microtubule density or organization. Of note, we cannot exclude that microtubule post-translational modifications contribute to differential MAP binding between nurse cells and the oocyte, but this remains a question for future investigation.

      The ability of ens-wt and ens-LowMT to induce tubulin polymerization according to the light scattering data (Fig. S1J) is minimal and does not reflect dramatic differences in localization. The authors should verify that, in all cases, the polymerization product in their in vitro assays is microtubules rather than other light-scattering aggregates. What is the control in these experiments? If it is just purified tubulin, it should not form polymers at physiological concentrations.

      The critical concentration Cr for microtubule self-assembly in classical BRB80 buffer found by us and others is around 20 µM (see Fig. 2c in Weiss et al., 2010). Here, microtubules were assembled at 40 µM tubulin concentration, i.e., largely above the Cr. As stated in the materials and methods section, we systematically induced cooling at 4°C after assembly to assess the presence of aggregates, since those do not fall apart upon cooling. The decrease in optical density upon cooling is a direct control that the initial increase in DO is due to the formation of microtubules. Finally, aggregation and polymerization curves are widely different, the former displaying an exponential shape and the latter a sigmoid assembly phase (see Fig. 3A and 3B in Weiss et al., 2010).

      Photoconversion caveatsMAPs are known to dynamically associate and dissociate from microtubules. Therefore, interpretation of the Ens photoconversion data should be made with caution. The expanding red signal from the nurse cells to the oocyte may reflect a any combination of dynein-mediated MT transport and passive diffusion of unbound Ensconsin. Notably, photoconversion of a soluble protein in the nurse cells would also result in a gradual increase in red signal in the oocyte, independent of active transport. We encourage the authors to more thoroughly discuss these caveats. It may also help to present the green and red channels side by side rather than as merged images, to allow readers to assess signal movement and spatial patterns better.

      This is a valid point that mirrors the comment of Reviewers 1 and 3. The directional movement of microtubules traveling at ~140 nm/s from nurse cells toward the oocyte via the ring canals was previously reported by Lu et al. (2022) with excellent spatial resolution. Notably, this MT transport was measured using a fusion protein containing the Ens MT-binding domain. We now cite this relevant study in our revised manuscript and have removed this redundant panel in Figure 1.

      Reduction of Shot at the anterior cortex• Shot is known to bind strongly to F-actin, and in the Drosophila ovary, its localization typically correlates more closely with F-actin structures than with microtubules, despite being an MT-actin crosslinker. Therefore, the observed reduction of cortical Shot in ens, nin mutants, and Khc-RNAi oocytes is unexpected. It would be important to determine whether cortical F-actin is also disrupted in these conditions, which should be straightforward to assess via phalloidin staining.

      As requested by the reviewer, we performed actin staining experiments, which are now presented in a new Figure S5. These data demonstrate that the cortical actin network remains intact in all mutant backgrounds analyzed, ruling out any indirect effect of actin cytoskeleton disruption on the observed phenotypes.

      MTs are barely visible in Fig. 3A, which is meant to demonstrate Ens-GFP colocalization with tubulin. Higher-quality images are needed.

      The revised version now provides significantly improved images to show the different components examined. Our data show that Ens and Ninein localize at the cell cortex where they co-localize with Shot and Patronin (Figure 2 A-C). In addition, novel images show that Ens extends along microtubules (new Figure 4 A).

      MT gradient in stage 9 oocytesIn ens-/-, nin-/-, and Khc-RNAi oocytes, is there any global defect in the stage 9 microtubule gradient? This information would help clarify the extent to which cortical localization defects reflect broader disruptions in microtubule polarity.

      We now provide quantitative analysis of microtubule (MT) array organization in novel figures (Figure 3D and Figure 5B). Our data reveal that both Khc RNAi and ens mutant oocytes exhibit severe disruption of MT orientation toward the posterior (new Figure 5B). Importantly, this defect is significantly less pronounced in Nin-/- oocytes, which retain residual ncMTOCs at the cortex (new Figure 3D). This differential phenotype supports our model that cortical ncMTOCs are critical for maintaining proper MT orientation toward the posterior side of the oocyte.

      Role of Ninein in cortical anchoringThe requirement for Ninein in cortical anchorage is the least convincing aspect of the manuscript and somewhat disrupts the narrative flow. First, it is unclear whether Ninein exhibits the same oocyte-enriched localization pattern as Ensconsin. Is Ninein detectable in nurse cells? Second, the Ninein antibody signal appears concentrated in a small area of the anterior-lateral oocyte cortex (Fig. 2A), yet Ninein loss leads to reduced Shot signal along a much larger portion of the anterior cortex (Fig. 2F)-a spatial mismatch that weakens the proposed functional relationship. Third, Ninein overexpression results in cortical aggregates that co-localize with Shot, Patronin, and Ensconsin. Are these aggregates functional ncMTOCs? Do microtubules emanate from these foci?

      We now provide a more comprehensive analysis of Ninein localization. Similar to Ensconsin (Ens), endogenous Ninein is enriched in the oocyte during the early stages of oocyte development but is also detected in NCs (see modified Figure 2 A and Lasko et al., 2016). Improved imaging of Ninein further shows that the protein partially co-localizes with Ens, and ncMTOCs at the anterior cortex and with Ens-bound MTs (Figure 2B, 2C).

      Importantly, loss of Ninein (Nin) only partially reduces the enrichment of Ens in the oocyte (Figure 2E). Both Ens and Kinesin heavy chain (Khc) remain partially functional and continue to target non-centrosomal microtubule-organizing centers (ncMTOCs) to the cortex (Figure 3A). In Nin-/- mutants, a subset of long cortical microtubules (MTs) is present, thereby generating cytoplasmic streaming, although less efficiently than under wild-type (WT) conditions (Figure 3F and 3G). As a non-essential gene, we envisage Ninein as a facilitator of MT organization during oocyte development.

      Finally, our new analyses demonstrate that large puncta containing Ninein, Shot, Patronin, and despite their size, appear to be relatively weak nucleation centers (revised Figure S4 E and Video 1). In addition, their presence does not bias overall MT architecture (Figure S4 F) nor impair oocyte development and fertility (Figure S4 G and Table 1).

      Inconsistency of Khc^MutEns rescueThe Khc^MutEns variant partially rescues cortical MT formation and restores a slow but measurable cytoplasmic flow yet it fails to rescue Staufen localization (Fig. 5). This raises questions about the consistency and completeness of the rescue. Could the authors clarify this discrepancy or propose a mechanistic rationale?

      This is a good point. The cytoplasmic flows (the consequence of cargo transport by Khc on MTs) generated by a constitutively active KhcMutEns in an ens mutant condition, are less efficient than those driven by Khc activated by Ens in a control condition (Figure 6C). The rescued flow is probably not efficient enough to completely rescue the Staufen localization at stage 10.

      Additionally, this KhcMutEns variant rescues the viability of embryos from Khc27 mutant germline clones oocytes but not from ens mutants (Table1). One hypothesis is that Ens harbors additional functions beyond Khc activation.

      This incomplete rescue of Ens by an active Khc variant could also be the consequence of the “paradox of co-dependence”: Kinesin-1 also transport the antagonizing motor Dynein that promotes cargo transport in opposite directions (Hancock et al., 2016). The phenotype of a gain of function variant is therefore complex to interpret. Consistent with this, both KhcMutEns-GFP and KhcDhinge2 two active Khc only rescues partially centrosome transport in ens mutant Neural Stem Cells (Figure S10).

      Minor points: 1. The pUbi-attB-Khc-GFP vector was used to generate the Khc^MutEns transgenic line, presumably under control of the ubiquitous ubi promoter. Could the authors specify which attP landing site was used? Additionally, are the transgenic flies viable and fertile, given that Kinesin-1 is hyperactive in this construct?

      All transgenic constructs were integrated at defined genomic landing sites to ensure controlled expression levels. Specifically, both GFP-tagged KhcWT and KhcMutEns were inserted at the VK05 (attP9A) site using PhiC31-mediated integration. Full details of the landing sites are provided in the Materials and Methods section. Both transgenic flies are homozygous lethal and the transgenes are maintained over TM6B balancers.

      On page 11 (Discussion, section titled "A dual Ensconsin oocyte enrichment mechanism achieves spatial relief of Khc inhibition"), the statement "many mutations in Kif5A are causal of human diseases" would benefit from a brief clarification. Since not all readers may be familiar with kinesin gene nomenclature, please indicate that KIF5A is one of the three human homologs of Kinesin heavy chain.

      We clarified this point in the revised version (lane 465-466).

      On page 16 (Materials and Methods, "Immunofluorescence in fly ovaries"), the sentence "Ovaries were mounted on a slide with ProlonGold medium with DAPI (Invitrogen)" should be corrected to "ProLong Gold."

      This is corrected.

      Significance

      This study shows that enrichment of MAP7/ensconsin in the oocyte is the mechanism of kinesin-1 activation there and is important for cytoplasmic streaming and localization non-centrosomal microtubule-organizing centers to the oocyte cortex

      We thank the reviewers for the accurate review of our manuscript and their positive feed-back.

      Reviewer #3

      Evidence, reproducibility and clarity

      The manuscript of Berisha et al., investigates the role of Ensconsin (Ens), Kinesin-1 and Ninein in organisation of microtubules (MT) in Drosophila oocyte. At stage 9 oocytes Kinesin-1 transports oskar mRNA, a posterior determinant, along MT that are organised by ncMTOCs. At stage 10b, Kinesin-1 induces cytoplasmic advection to mix the contents of the oocyte. Ensconsin/Map7 is a MT associated protein (MAP) that uses its MT-binding domain (MBD) and kinesin binding domain (KBD) to recruit Kinesin-1 to the microtubules and to stimulate the motility of MT-bound Kinesin-1. Using various new Ens transgenes, the authors demonstrate the requirement of Ens MBD and Ninein in Ens localisation to the oocyte where Ens activates Kinesin-1 using its KBD. The authors also claim that Ens, Kinesin-1 and Ninein are required for the accumulation of ncMTOCs at the oocyte cortex and argue that the detachment of the ncMTOCs from the cortex accounts for the reduced localisation of oskar mRNA at stage 9 and the lack of cytoplasmic streaming at stage 10b. Although the manuscript contains several interesting observations, the authors' conclusions are not sufficiently supported by their data. The structure function analysis of Ensconsin (Ens) is potentially publishable, but the conclusions on ncMTOC anchoring and cytoplasmic streaming not convincing.

      We are grateful that the regulation of Khc activity by MAP7 was well received by all reviewers. While our study focuses on Drosophila oogenesis, we believe this mechanism may have broader implications for understanding kinesin regulation across biological systems.

      For the novel function of the MAP7/Khc complex in organizing its own microtubule networks through ncMTOC recruitment, we have carefully considered the reviewers' constructive recommendations. We now provide additional experimental evidence supporting a model of flux self-amplification in which ncMTOC recruitment plays a key role. It is well established that cytoplasmic flows are essential for posterior localization of cell fate determinants at stage 10B. Slow flows have also been described at earlier oogenesis stages by the groups of Saxton and St Johnston. Building on these early publications and our new experiments, we propose that these flows are essential to promote a positive feedback loop that reinforces ncMTOC recruitment and MT organization (Figure 7).

      1) The main conclusion of the manuscript is that "MT advection failure in Khc and ens in late oogenesis stems from defective cortical ncMTOCs recruitment". This completely overlooks the abundant evidence that Kinesin-1 directly drives cytoplasmic streaming by transporting vesicles and microtubules along microtubules, which then move the cytoplasm by advection (Palacios et al., 2002; Serbus et al, 2005; Lu et al, 2016). Since Kinesin-1 generates the flows, one cannot conclude that the effect of khc and ens mutants on cortical ncMTOC positioning has any direct effect on these flows, which do not occur in these mutants.

      We regret the lack of clarity of the first version of the manuscript and some missing references. We propose a model in which the Kinesin-1- dependent slow flows (described by Serbus/Saxton and Palacios/StJohnston) play a central role in amplifying ncMTOC anchoring and cortical MT network formation (see model in the new Figure 7).

      2) The authors claim that streaming phenotypes of ens and khs mutants are due to a decrease in microtubule length caused by the defective localisation of ncMTOCs. In addition to the problem raised above, However, I am not convinced that they can make accurate measurements of microtubule length from confocal images like those shown in Figure 4. Firstly, they are measuring the length of bundles of microtubules and cannot resolve individual microtubules. This problem is compounded by the fact that the microtubules do not align into parallel bundles in the mutants. This will make the "microtubules" appear shorter in the mutants. In addition, the alignment of the microtubules in wild-type allows one to choose images in which the microtubule lie in the imaging plane, whereas the more disorganized arrangement of the microtubules in the mutants means that most microtubules will cross the imaging plane, which precludes accurate measurements of their length.

      As mentioned by Reviewer 4, we have been transparent with the methodology, and the limitations that were fully described in the material and methods section.

      Cortical microtubules in oocytes are highly dynamic and move rapidly, making it technically impossible to capture their entire length using standard Z-stack acquisitions. We therefore adopted a compromise approach: measuring microtubules within a single focal plane positioned just below the oocyte cortex. This strategy is consistent with established methods in the field, such as those used by Parton et al. (2011) to track microtubule plus-end directionality. To avoid overinterpretation, we explicitly refer to these measurements as "minimum detectable MT length," acknowledging that microtubules may extend beyond the focal plane, particularly at stage 10, where long, tortuous bundles frequently exit the plane of focus. These methodological considerations and potential biases are clearly described in the Materials and Methods section and the text now mentions the possible disorganization of the MT network in the mutant conditions (lane 272-273).

      In this revised version, we now provide complementary analyses of MT network organization.Beyond length measurements (and the mentioned limitations), we also quantified microtubule network orientation at stage 9, assessing whether cortical microtubules are preferentially oriented toward the posterior axis as observed in controls (revised Figure 3D and Figure 5B). While this analysis is also subject to the same technical limitations, it reveals a clear biological difference: microtubules exhibit posterior-biased orientation in control oocytes similar to a previous study (Parton et al., 2011) but adopt a randomized orientation in Nin-/-, ens, and Khc RNAi-depleted oocytes (revised Figure 3D and Figure 5B).

      Taken together, these complementary approaches, despite their technical constraints, provide convergent evidence for the role of the Khc/Ens complex in organizing cortical microtubule networks during oogenesis.

      3) "To investigate whether the presence of these short microtubules in ens and Khc RNAi oocytes is due to defects in microtubule anchoring or is also associated with a decrease in microtubule polymerization at their plus ends, we quantified the velocity and number of EB1comets, which label growing microtubule plus ends (Figure S3)." I do not understand how the anchoring or not of microtubule minus ends to the cortex determines how far their plus ends grow, and these measurements fall short of showing that plus end growth is unaffected. It has already been shown that the Kinesin-1-dependent transport of Dynactin to growing microtubule plus ends increases the length of microtubules in the oocyte because Dynactin acts as an anti-catastrophe factor at the plus ends. Thus, khc mutants should have shorter microtubules independently of any effects on ncMTOC anchoring. The measurements of EB1 comet speed and frequency in FigS2 will not detect this change and are not relevant for their claims about microtubule length. Furthermore, the authors measured EB1 comets at stage 9 (where they did not observe short MT) rather than at stage 10b. The authors' argument would be better supported if they performed the measurements at stage 10b.

      We thank the reviewer for raising this important point. The short microtubule (MT) length observed at stage 10B could indeed result from limited plus-end growth. Unfortunately, we were unable to test this hypothesis directly: strong endogenous yolk autofluorescence at this stage prevented reliable detection of Eb1-GFP comets, precluding velocity measurements.

      At least during stage 9, our data demonstrate that MT nucleation and polymerization rates are not reduced in both KhcRNAi and ens mutant conditions, indicating that the observed MT alterations must arise through alternative mechanisms.

      In the discussion, we propose the following interconnected explanations, supported by recent literature and the reviewers’ suggestions:

      1- Reduced MT rescue events. Two seminal studies from the Verhey and Aumeier laboratories have shown that constitutively active Kinesin-1 induces MT lattice damage (Budaitis et al., 2022), which can be repaired through GTP-tubulin incorporation into "rescue shafts" that promote MT rescue (Andreu-Carbo et al., 2022). Extrapolating from these findings, loss of Kinesin-1 activity could plausibly reduce rescue shaft formation, thereby decreasing MT stability. While challenging to test directly in our system, this mechanism provides a plausible framework for the observed phenotype.

      2- Impaired transport of stabilizing factors. As that reviewer astutely points out, Khc transports the dynactin complex, an anti-catastrophe factor, to MT plus ends (Nieuwburg et al., 2017). Loss of this transport could further compromise MT plus end stability. We now discuss this important mechanism in the revised manuscript.

      3- Loss of cortical ncMTOCs. Critically, our new quantitative analyses (revised Figure 3 and Figure 5) also reveal defective anteroposterior orientation of cortical MTs in mutant conditions. These experiments suggest that Ens/Khc-mediated localization of ncMTOCs to the cortex is essential for proper MT network organization, and possibly minus-end stabilization as suggested in several studies (Feng et al., 2019, Goodwin and Vale, 2011, Nashchekin et al., 2016).

      Altogether, we now propose an integrated model in which MT reduction and disorganization may result from multiple complementary mechanisms operating downstream of Kinesin-1/Ensconsin loss. While some aspects remain difficult to test directly in our in vivo system, the convergence of our data with recent mechanistic studies provides an interesting conceptual framework. The Discussion has been revised to reflect this comprehensive view in a dedicated paragraph (“A possible regulation of MT dynamics in the oocyte at both plus end minus MT ends by Ens and Khc” lane 415-432).

      4) The Shot overexpression experiments presented in Fig.3 E-F, Fig.4D and TableS1 are very confusing. Originally , the authors used Shot-GFP overexpression at stage 9 to show that there is a decrease of ncMTOCs at the cortex in ens mutants (Fig.3 E-F) and speculated that this caused the defects in MT length and cytoplasmic advection at stage 10B. However the authors later state on page 8 that : "Shot overexpression (Shot OE) was sufficient to rescue the presence of long cortical MTs and ooplasmic advection in most ens oocytes (9/14), resembling the patterns observed in controls (Figures 4B right panel and 4D). Moreover, while ens females were fully sterile, overexpression of Shot was sufficient to restore that loss of fertility (Table S1)". Is this the same UAS Shot-GFP and VP16 Gal4 used in both experiments? If so, this contradictions puts the authors conclusions in question.

      This is an important point that requires clarification regarding our experimental design.

      The Shot-YFP construct is a genomic insertion on chromosome 3. The ens mutation is also located on chromosome 3 and we were unable to recombine this transgene with the ens mutant for live quantification of cortical Shot. To circumvent this technical limitation, we used a UAS-Shot.L(C)-GFP transgenic construct driven by a maternal driver, expressed in both wild-type (control) and ens mutant oocytes. We validated that the expression level and subcellular localization of UAS-Shot.L(C)-GFP were comparable to those of the genomic Shot-YFP (new Figure S8 A and B).

      From these experiments, we drew two key conclusions. First, cortical Shot.L(C)-GFP is less abundant in ens mutant oocytes compared to wild-type (the quantification has been removed from this version). Second, despite this reduced cortical accumulation, Shot.L(C)-GFP expression partially rescues ooplasmic flows and microtubule streaming in stage 10B ens mutant oocytes, and restores fertility to ens mutant females.

      5) The authors based they conclusions about the involvement of Ens, Kinesin-1 and Ninein in ncMTOC anchoring on the decrease in cortical fluorescence intensity of Shot-YFP and Patronin-YFP in the corresponding mutant backgrounds. However, there is a large variation in average Shot-YFP intensity between control oocytes in different experiments. In Fig. 2F-G the average level of Shot-YFP in the control sis 130 AU while in Fig.3 G-H it is only 55 AU. This makes me worry about reliability of such measurements and the conclusions drawn from them.

      To clarify this point, we have harmonized the method used to quantify the Shot-YFP signals in Figure 4E with the methodology used in Figure 3B, based on the original images. The levels are not strictly identical (Control Figure 2 B: 132.7+/-36.2 versus Control Figure 4 E: 164.0+/- 37.7). These differences are usual when experiments are performed at several-month intervals and by different users.

      6) The decrease in the intensity of Shot-YFP and Patronin-YFP cortical fluorescence in ens mutant oocytes could be because of problems with ncMTOC anchoring or with ncMTOCs formation. The authors should find a way to distinguish between these two possibilities. The authors could express Ens-Mut (described in Sung et al 2008), which localises at the oocyte posterior and test whether it recruits Shot/Patronin ncMTOCs to the posterior.

      We tried to obtain the fly stocks described in the 2008 paper by contacting former members of Pernille Rørth's laboratory. Unfortunately, we learned that the lab no longer exists and that all reagents, including the requested stocks, were either discarded or lost over time. To our knowledge, these materials are no longer available from any source. We regret that this limitation prevented us from performing the straightforward experiments suggested by the reviewer using these specific tools.

      7) According to the Materials and Methods, the Shot-GFP used in Fig.3 E-F and Fig.4 was the BDSC line 29042. This is Shot L(C), a full-length version of Shot missing the CH1 actin-binding domain that is crucial for Shot anchoring to the cortex. If the authors indeed used this version of Shot-GFP, the interpretation of the above experiments is very difficult.

      The Shot.L(C) isoform lacks the CH1 domain but retains the CH2 actin-binding motif. Truncated proteins with this domain and fused to GST retains a weak ability to bind actin in vitro. Importantly, the function of this isoform is context-dependent: it cannot rescue shot loss-of-function in neuron morphogenesis but fully restores Shot-dependent tracheal cell remodeling (Lee and Kolodziej, 2002).

      In our experiments, when the Shot.L(C) isoform was expressed under the control of a maternal driver, its localization to the oocyte cortex was comparable to that of the genomic Shot-YFP construct (new Figure S8). This demonstrates unambiguously that the CH1 domain is dispensable for Shot cortical localization in oocytes, and that CH2-mediated actin binding is sufficient for this localization. Of note, a recent study showed that actin network are not equivalent highlighting the need for specific Shot isoforms harboring specialized actin-binding domain (Nashchekin et al., 2024).

      We note that the expression level of Shot.L(C)-GFP in the oocyte appeared slightly lower than that of Shot-YFP (expressed under endogenous Shot regulatory sequences), as assessed by Western blot (Figure S8 A).

      Critically, Shot.L(C)-GFP expression was substantially lower than that of Shot.L(A)-GFP (that harbored both the CH1 and CH2 domain). Shot.L(A)-GFP was overexpressed (Figure 8 A) and ectopically localized on MTs in both nurse cells and the ooplasm (Figure S8 B middle panel and arrow). These observations are in agreement that the Shot.L(C)-GFP rescue experiment was performed at near-physiological expression levels, strengthening the validity of our conclusions.

      8) Page 6 "converted in NCs, in a region adjacent to the ring canals, Dendra-Ens-labeled MTs were found in the oocyte compartment indicating they are able to travel from NC toward the oocyte through ring canals". I have difficulty seeing the translocation of MT through the ring canals. Perhaps it would be more obvious with a movie/picture showing only one channel. Considering that f Dendra-Ens appears in the oocyte much faster than MT transport through ring canals (140nm/s, Lu et al 2022), the authors are most probably observing the translocation of free Ens rather than Ens bound to MT. The authors should also mention that Ens movement from the NC to the oocyte has been shown before with Ens MBD in Lu et al 2022 with better resolution.

      We fully agree on the caveat mentioned by this reviewer: we may observe the translocation of free Dendra-Ensconsin. The experiment, was removed and replaced by referring to the work of the Gelfand lab. The movement of MTs that travel at ~140 nm/s between nurse cells toward the oocyte through the Ring Canals was reported before by Lu et al. (2022) with a very good resolution. Notably, this directional directed movement of MTs was measured using a fusion protein encompassing Ens MT-binding domain. We decided to remove this inclusive experiment and rather refer to this relevant study.

      9) Page 6: The co-localization of Ninein with Ens and Shot at the oocyte cortex (Figure 2A). I have difficulty seeing this co-localisation. Perhaps it would be more obvious in merged images of only two channels and with higher resolution images

      10) "a pool of the Ens-GFP co-localized with Ch-Patronin at cortical ncMTOCs at the anterior cortex (Figure 3A)". I also have difficulty seeing this.

      We have performed new high-resolution acquisitions that provide clearer and more convincing evidence for the localization cortical distribution of these proteins (revised Figure 2A-2C and Figure 4A). These improved images demonstrate that Ens, Ninein, Shot, and Patronin partially colocalize at cortical ncMTOCs, as initially proposed. Importantly, the new data also reveal a spatial distinction: while Ens localizes along microtubules extending from these cortical sites, Ninein appears confined to small cytoplasmic puncta adjacent but also present on cortical microtubules.

      11) "Ninein co-localizes with Ens at the oocyte cortex and partially along cortical microtubules, contributing to the maintenance of high Ens protein levels in the oocyte and its proper cortical targeting". I could not find any data showing the involvement of Ninein in the cortical targeting of Ens.

      We found decreased Ens localization to MTs and to the cell cortex region (new Figure S3 A-B).

      12) "our MT network analyses reveal the presence of numerous short MTs cytoplasmic clustered in an anterior pattern." "This low cortical recruitment of ncMTOCs is consistent with poor MT anchoring and their cytoplasmic accumulation." I could not find any data showing that short cortical MT observed at stage 10b in ens mutant and Khc RNAi were cytoplasmic and poorly anchored.

      The sentence was removed from the revised manuscript.

      13) "The egg chamber consists of interconnected cells where Dynein and Khc activities are spatially separated. Dynein facilitates transport from NCs to the oocyte, while Khc mediates both transport and advection within the oocyte." Dynein is involved in various activities in the oocyte. It anchors the oocyte nucleus and transports bcd and grk mRNA to mention a few.

      The text was amended to reflect Dynein involvement in transport activities in the oocyte, with the appropriate references (lane 105-107).

      14) The cartoons in Fig.2H and 3I exaggerate the effect of Ninein and Ens on cortical ncMTOCs. According to the corresponding graphs, there is a 20 and 50% decrease in each case.

      New cartoons (now revised Figure 3E and 4F), are amended to reflect the ncMTOC values but also MT orientation (Figure 3E).

      Significance

      Given the important concerns raised, the significance of the findings is difficult to assess at this stage.

      We sincerely thank the reviewer for their thorough evaluation of our manuscript. We have carefully addressed their concerns through substantial new experiments and analyses. We hope that the revised manuscript, in its current form, now provides the clarifications and additional evidence requested, and that our responses demonstrate the significance of our findings.

      Reviewer #4 (Evidence, reproducibility and clarity (Required)):

      Summary: This manuscript presents an investigation into the molecular mechanisms governing spatial activation of Kinesin-1 motor protein during Drosophila oogenesis, revealing a regulatory network that controls microtubule organization and cytoplasmic transport. The authors demonstrate that Ensconsin, a MAP7 family protein and Kinesin-1 activator, is spatially enriched in the oocyte through a dual mechanism involving Dynein-mediated transport from nurse cells and cortical maintenance by Ninein. This spatial enrichment of Ens is crucial for locally relieving Kinesin-1 auto-inhibition. The Ens/Khc complex promotes cortical recruitment of non-centrosomal microtubule organizing centers (ncMTOCs), which are essential for anchoring microtubules at the cortex, enabling the formation of long, parallel microtubule streams or "twisters" that drive cytoplasmic advection during late oogenesis. This work establishes a paradigm where motor protein activation is spatially controlled through targeted localization of regulatory cofactors, with the activated motor then participating in building its own transport infrastructure through ncMTOC recruitment and microtubule network organization.

      There's a lot to like about this paper! The data are generally lovely and nicely presented. The authors also use a combination of experimental approaches, combining genetics, live and fixed imaging, and protein biochemistry.

      We thank the reviewer for this enthusiastic and supportive review, which helped us further strengthen the manuscript.

      Concerns: Page 6: "to assay if elevation of Ninein levels was able to mis-regulate Ens localization, we overexpressed a tagged Ninein-RFP protein in the oocyte. At stage 9 the overexpressed Ninein accumulated at the anterior cortex of the oocyte and also generated large cortical aggregates able to recruit high levels of Ens (Figures 2D and 2H)... The examination of Ninein/Ens cortical aggregates obtained after Ninein overexpression showed that these aggregates were also able to recruit high levels of Patronin and Shot (Figures 2E and 2H)." Firstly, I'm not crazy about the use of "overexpressed" here, since there isn't normally any Ninein-RFP in the oocyte. In these experiments it has been therefore expressed, not overexpressed. Secondly, I don't understand what the reader is supposed to make of these data. Expression of a protein carrying a large fluorescent tag leads to large aggregates (they don't look cortical to me) that include multiple proteins - in fact, all the proteins examined. I don't understand this to be evidence of anything in particular, except that Ninein-RFP causes the accumulation of big multi-protein aggregates. While I can understand what the authors were trying to do here, I think that these data are inconclusive and should be de-emphasized.

      We have revised the manuscript by replacing overexpressed with expressed (lanes 211 and 212). In addition, we now provide new localization data in both cortical (new Figure S4 A, top) and medial focal planes (new Figure S4 A, bottom), demonstrating that Ninein puncta (the word used in Rosen et al, 2019), rather than aggregates are located cortically. We also show that live IRP-labelled MTs do not colocalize with Ninein-RFP puncta. In light of the new experiments and the comments from the other reviewers, the corresponding text has been revised and de-emphasized accordingly.

      Page 7: "Co-immunoprecipitations experiments revealed that Patronin was associated with Shot-YFP, as shown previously (Nashchekin et al., 2016), but also with EnsWT-GFP, indicating that Ens, Shot and Patronin are present in the same complex (Figure 3B)." I do not agree that association between Ens-GFP and Patronin indicates that Ens is in the same complex as Shot and Patronin. It is also very possible that there are two (or more) distinct protein complexes. This conclusion could therefore be softened. Instead of "indicating" I suggest "suggesting the possibility."

      We have toned down this conclusion and indicated “suggesting the possibility” (lane 238-239).

      Page 7: "During stage 9, the average subcortical MT length, taken at one focal plane in live oocytes (see methods)..." I appreciate that the authors have been careful to describe how they measured MT length, as this is a major point for interpretation. I think the reader would benefit from an explanation of why they decided to measure in only one focal plane and how that decision could impact the results.

      We appreciate this helpful suggestion. Cortical microtubules are indeed highly dynamic and extend in multiple directions, including along the Z-axis. Moreover, their diameter is extremely small (approximately 25 nm), making it technically challenging to accurately measure their full length with high resolution using our Zeiss Airyscan confocal microscope (over several, microns): the acquisition of Z-stacks is relatively slow and therefore not well suited to capturing the rapid dynamics of these microtubules. Consequently, our length measurements represent a compromise and most likely underestimate the actual lengths of microtubules growing outside the focal plane. We note that other groups have encountered similar technical limitations (Parton et al., 2011).

      Page 7: "... the MTs exhibited an orthogonal orientation relative to the anterior cortex (Figures 4A left panels, 4C and 4E)." This phenotype might not be obvious to readers. Can it be quantified?

      We have now analyzed the orientation of microtubules (MTs) along the dorso-ventral axis. Our analysis shows that ens, Khc RNAi oocytes (new Figure 5B), and, to a lesser extent, Nin mutant oocytes (new Figure 3D), display a more random MT orientation compared to wild-type (WT) oocytes. In WT oocytes, MTs are predominantly oriented toward the posterior pole, consistent with previous findings (Parton et al., 2011).

      Page 8: "Altogether, the analyses of Ens and Khc defective oocytes suggested that MT organization defects during late oogenesis (stage 10B) were caused by an initial failure of ncMTOCs to reach the cell cortex. Therefore, we hypothesized that overexpression of the ncMTOC component Shot could restore certain aspects of microtubule cortical organization in ens-deficient oocytes. Indeed, Shot overexpression (Shot OE) was sufficient to rescue the presence of long cortical MTs and ooplasmic advection in most ens oocytes (9/14)..." The data are clear, but the explanation is not. Can the authors please explain why adding in more of an ncMTOC component (Shot) rescues a defect of ncMTOC cortical localization?

      We propose that cytoplasmic ncMTOCs can bind the cell cortex via the Shot subunit that is so far the only component that harbors actin-binding motifs. Therefore, we propose that elevating cytoplasmic Shot increase the possibility of Shot to encounter the cortex by diffusion when flows are absent. This is now explained lane 282-285.

      I'm grateful to the authors for their inclusion of helpful diagrams, as in Figures 1G and 2H. I think the manuscript might benefit from one more of these at the end, illustrating the ultimate model.

      We have carefully considered and followed the reviewer’s suggestions. In response, we have included a new figure illustrating our proposed model: the recruitment of ncMTOCs to the cell cortex through low Khc-mediated flows at stage 9 enhances cortical microtubule density, which in turn promotes self-amplifying flows (new Figure 7, panels A to C). Note that this Figure also depicts activation of Khc by loss of auto-inhibition (Figure 7, panel D).

      I'm sorry to say that the language could use quite a bit of polishing. There are missing and extraneous commas. There is also regular confusion between the use of plural and singular nouns. Some early instances include:

      1. Page 3: thought instead of "thoughted."
      2. Page 5: "A previous studies have revealed"
      3. Page 5: "A significantly loss"
      4. Page 6: "troughs ring canals" should be "through ring canals"
      5. Page 7: lives stage 9 oocytes
      6. Page 7: As ens and Khc RNAi oocytes exhibits
      7. Page 7: we examined in details
      8. Page 7: This average MT length was similar in Khc RNAi and ens mutant oocyte..

      We apologize for errors. We made the appropriate corrections of the manuscript.

      Reviewer #4 (Significance (Required)):

      This work makes a nice conceptual advance by showing that motor activation controls its own transport infrastructure, a paradigm that could extend to other systems requiring spatially regulated transport.

      We thank the reviewers for their evaluation of the manuscript and helpful comments.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This paper presents two experiments, both of which use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable-length nonsense words (two trisyllabic words and two disyllabic words) and perform the same task. A similar facilitation effect was observed as in Experiment 1. The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task and find that an "anticipation mechanism" can produce facilitation effects, without performing segmentation. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.

      Strengths:

      The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors.

      Weaknesses:

      In my view, the main weaknesses of this study relate to the theoretical interpretation of the results.

      (1) The key conclusion from these findings is that the facilitation effect observed in the target detection paradigm is driven by a different mechanism (or mechanisms) than those involved in word segmentation. The argument here I think is somewhat unclear and weak, for several reasons:

      First, there appears to be some blurring in what exactly is meant by the term "segmentation" with some confusion between segmentation as a concept and segmentation as a paradigm.

      Conceptually, segmentation refers to the segmenting of continuous speech into words. However, this conceptual understanding of segmentation (as a theoretical mechanism) is not necessarily what is directly measured by "traditional" studies of statistical learning, which typically (at least in adults) involve exposure to a continuous speech stream followed by a forced-choice recognition task of words versus recombined foil items (part-words or nonwords). To take the example provided by the authors, a participant presented with the sequence GHIABCDEFABCGHI may endorse ABC as being more familiar than BCG, because ABC is presented more frequently together and the learned association between A and B is stronger than between C and G. However, endorsement of ABC over BCG does not necessarily mean that the participant has "segmented" ABC from the speech stream, just as faster reaction times in responding to syllable C versus A do not necessarily indicate successful segmentation. As the authors argue on page 7, "an encounter to a sequence in which two elements co-occur (say, AB) would theoretically allow the learner to use the predictive relationship during a subsequent encounter (that A predicts B)." By the same logic, encoding the relationship between A and B could also allow for the above-chance endorsement of items that contain AB over items containing a weaker relationship.

      Both recognition performance and facilitation through target detection reflect different outcomes of statistical learning. While they may reflect different aspects of the learning process and/or dissociable forms of memory, they may best be viewed as measures of statistical learning, rather than mechanisms in and of themselves.

      Thanks for this nuanced discussion, and this is an important point that R2 also raised. We agree that segmentation can refer to both an experimental paradigm and a mechanism that accounts for learning in the experimental paradigm. In the experimental paradigm, participants are asked to identify which words they believe to be (whole) words from the continuous syllable stream. In the target-detection experimental paradigm, participants are not asked to identify words from continuous streams, and instead, they respond to the occurrences of a certain syllable. It’s possible that learners employ one mechanism in these two tasks, or that they employ separate mechanisms. It’s also the case that, if all we have is positive evidence for both experimental paradigms, i.e., learners can succeed in segmentation tasks as well as in target detection tasks with different types of sequences, we would have no way of talking about different mechanisms, as you correctly suggested that evidence for segmenting AB and processing B faster following A, is not evidence for different mechanisms.

      However, that is not the case. When the syllable sequences contain same-length subsequences (i.e., words), learning is indeed successful in both segmentation and target detection tasks. However, in studies such as Hoch et al. (2013), findings suggest that words from mixed-length sequences are harder to segment than words from uniform-length sequences. This finding exists in adult work (e.g., Hoch et al. 2013) as well as infant work (Johnson & Tyler, 2010), and replicated here in the newly included Experiment 3, which stands in contrast to the positive findings of the facilitation effect with mixed-length sequences in the target detection paradigm (one of our main findings in the paper). Thus, it seems to be difficult to explain, if the learning mechanisms were to be the same, why humans can succeed in mixed-length sequences in target detection (as shown in Experiment 2) but fail in uniform-length sequences (as shown in Hoch et al. and Experiment 3).

      In our paper, we have clarified these points describe the separate mechanisms in more detail, in both the Introduction and General Discussion sections.

      (2) The key manipulation between experiments 1 and 2 is the length of the words in the syllable sequences, with words either constant in length (experiment 1) or mixed in length (experiment 2). The authors show that similar facilitation levels are observed across this manipulation in the current experiments. By contrast, they argue that previous findings have found that performance is impaired for mixed-length conditions compared to fixed-length conditions. Thus, a central aspect of the theoretical interpretation of the results rests on prior evidence suggesting that statistical learning is impaired in mixed-length conditions. However, it is not clear how strong this prior evidence is. There is only one published paper cited by the authors - the paper by Hoch and colleagues - that supports this conclusion in adults (other mentioned studies are all in infants, which use very different measures of learning). Other papers not cited by the authors do suggest that statistical learning can occur to stimuli of mixed lengths (Thiessen et al., 2005, using infant-directed speech; Frank et al., 2010 in adults). I think this theoretical argument would be much stronger if the dissociation between recognition and facilitation through RTs as a function of word length variability was demonstrated within the same experiment and ideally within the same group of participants.

      To summarize the evidence of learning uniform-length and mixed-length sequences (which we discussed in the Introduction section), “even though infants and adults alike have shown success segmenting syllable sequences consisting of words that were uniform in length (i.e., all words were either disyllabic; Graf Estes et al., 2007; or trisyllabic, Aslin et al., 1998), both infants and adults have shown difficulty with syllable sequences consisting of words of mixed length (Johnson & Tyler, 2010; Johnson & Jusczyk, 2003a; 2003b; Hoch et al., 2013).” The newly added Experiment 3 also provided evidence for the difference in uniform-length and mixed-length sequences. Notably, we do not agree with the idea that infant work should be disregarded as evidence just because infants were tested with habituation methods; not only were the original findings (Saffran et al. 1996) based on infant work, so were many other studies on statistical learning.

      There are other segmentation studies in the literature that have used mixed-length sequences, which are worth discussing. In short, these studies differ from the Saffran et al. (1996) studies in many important ways, and in our view, these differences explain why the learning was successful. Of interest, Thiessen et al. (2005) that you mentioned was based on infant work with infant methods, and demonstrated the very point we argued for: In their study, infants failed to learn when mixed-length sequences were pronounced as adult-directed speech, and succeeded in learning given infant-directed speech, which contained prosodic cues that were much more pronounced. The fact that infants failed to segment mixed-length sequences without certain prosodic cues is consistent with our claim that mixed-length sequences are difficult to segment in a segmentation paradigm. Another such study is Frank et al. (2010), where continuous sequences were presented in “sentences”. Different numbers of words were concatenated into sentences where a 500ms break was present between each sentence in the training sequence. One sentence contained only one word, or two words, and in the longest sentence, there were 24 words. The results showed that participants are sensitive to the effect of sentence boundaries, which coincide with word boundaries. In the extreme, the one-word-per-sentence condition simply presents learners with segmented word forms. In the 24-word-per-sentence condition, there are nevertheless sentence boundaries that are word boundaries, and knowing these word boundaries alone should allow learners to perform above chance in the test phase. Thus, in our view, this demonstrates that learners can use sentence boundaries to infer word boundaries, which is an interesting finding in its own right, but this does not show that a continuous syllable sequence with mixed word lengths is learnable without additional information. In summary, to our knowledge, syllable sequences containing mixed word lengths are better learned when additional cues to word boundaries are present, and there is strong evidence that syllable sequences containing uniform-word lengths are learned better than mixed-length ones.

      Frank, M. C., Goldwater, S., Griffiths, T. L., & Tenenbaum, J. B. (2010). Modeling human performance in statistical word segmentation. Cognition, 117(2), 107-125.

      To address your proposal of running more experiments to provide stronger evidence for our theory, we were planning to run another study to have the same group of participants do both the segmentation and target detection paradigm as suggested, but we were unable to do so as we encountered difficulties to run English-speaking participants. Instead, we have included an experiment (now Experiment 3), showing the difference between the learning of uniform-length and mixed-length sequences with the segmentation paradigm that we have never published previously. This experiment provides further evidence for adults’ difficulties in segmenting mixed-length sequences.

      (3) The authors argue for an "anticipation" mechanism in explaining the facilitation effect observed in the experiments. The term anticipation would generally be understood to imply some kind of active prediction process, related to generating the representation of an upcoming stimulus prior to its occurrence. However, the computational model proposed by the authors (page 24) does not encode anything related to anticipation per se. While it demonstrates facilitation based on prior occurrences of a stimulus, that facilitation does not necessarily depend on active anticipation of the stimulus. It is not clear that it is necessary to invoke the concept of anticipation to explain the results, or indeed that there is any evidence in the current study for anticipation, as opposed to just general facilitation due to associative learning.

      Thanks for raising this point. Indeed, the anticipation effect we reported is indistinguishable from the facilitation effect that we reported in the reported experiments. We have dropped this framing.

      In addition, related to the model, given that only bigrams are stored in the model, could the authors clarify how the model is able to account for the additional facilitation at the 3rd position of a trigram compared to the 2nd position?

      Thanks for the question. We believe it is an empirical question whether there is an additional facilitation at the 3rd position of a trigram compared to the 2nd position. To investigate this issue, we conducted the following analysis with data from Experiment 1. First, we combined the data from two conditions (exact/conceptual) from Experiment 1 so as to have better statistical power. Next, we ran a mixed effect regression with data from syllable positions 2 and 3 only (i.e., data from syllable position 1 were not included). The fixed effect included the two-way interaction between syllable position and presentation, as well as stream position, and the random effect was a by-subject random intercept and stream position as the random slope. This interaction was significant (χ<sup>2</sup>(3) =11.73, p=0.008), suggesting that there is additional facilitation to the 3rd position compared to the 2nd position.

      For the model, here is an explanation of why the model assumes an additional facilitation to the 3rd position. In our model, we proposed a simple recursive relation between the RT of a syllable occurring for the nth time and the n+1<sup>th</sup> time, which is:

      and

      RT(1) = RT0 + stream_pos * stream_inc, where the n in RT(n) represents the RT for the n<sup>th</sup> presentation of the target syllable, stream_pos is the position (3-46) in the stream, and occurrence is the number of occurrences that the syllable has occurred so far in the stream.

      What this means is that the model basically provides an RT value for every syllable in the stream. Thus, for a target at syllable position 1, there is a RT value as an unpredictable target, and for targets at syllable position 2, there is a facilitation effect. For targets at syllable position 3, it is facilitated the same amount. As such, there is an additional facilitation effect for syllable position 3 because effects of predication are recursive.

      (4) In the discussion of transitional probabilities (page 31), the authors suggest that "a single exposure does provide information about the transitions within the single exposure, and the probability of B given A can indeed be calculated from a single occurrence of AB." Although this may be technically true in that a calculation for a single exposure is possible from this formula, it is not consistent with the conceptual framework for calculating transitional probabilities, as first introduced by Saffran and colleagues. For example, Saffran et al. (1996, Science) describe that "over a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur across word boundaries. Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low." This makes it clear that the computation of transitional probabilities (i.e., Y | X) is conceptualized to reflect the frequency of XY / frequency of X, over a given language inventory, not just a single pair. Phrased another way, a single exposure to pair AB would not provide a reliable estimate of the raw frequencies with which A and AB occur across a given sample of language.

      Thanks for the discussion. We understand your argument, but we respectively disagree that computing transitional probabilities must be conducted under a certain theoretical framework. In our humble opinion, computing transitional probabilities is a mathematical operation, and as such, it is possible to do so with the least amount of data possible that enables the mathematical operation, which concretely is a single exposure during learning. While it is true that a single exposure may not provide a reliable estimate of frequencies or probabilities, it does provide information with which the learner can make decisions.

      This is particularly true for topics under discussion regarding the minimal amount of exposure that can enable learning. It is important to distinguish the following two questions: whether learners can learn from a short exposure period (from a single exposure, in fact) and how long of an exposure period does the learner require for it to be considered to produce a reliable estimate of frequencies. Incidentally, given the fact that learners can learn from a single exposure based on Batterink (2017) and the current study, it does not appear that learners require a long exposure period to learn about transitional probabilities.

      (5) In experiment 2, the authors argue that there is robust facilitation for trisyllabic and disyllabic words alike. I am not sure about the strength of the evidence for this claim, as it appears that there are some conflicting results relevant to this conclusion. Notably, in the regression model for disyllabic words, the omnibus interaction between word presentation and syllable position did not reach significance (p= 0.089). At face value, this result indicates that there was no significant facilitation for disyllabic words. The additional pairwise comparisons are thus not justified given the lack of omnibus interaction. The finding that there is no significant interaction between word presentation, word position, and word length is taken to support the idea that there is no difference between the two types of words, but could also be due to a lack of power, especially given the p-value (p = 0.010).

      Thanks for the comment. Firstly, we believe there is a typo in your comment, where in the last sentence, we believe you were referring to the p-value of 0.103 (source: “The interaction was not significant (χ2(3) = 6.19, p= 0.103”). Yes, a null result with a frequentist approach cannot support a null claim, but Bayesian analyses could potentially provide evidence for the null.

      To this end, we conducted a Bayes factor analysis using the approach outlined in Harms and Lakens (2018), which generates a Bayes factor by computing a Bayesian information criterion for a null model and an alternative model. The alternative model contained a three-way interaction of word length, word presentation, and word position, whereas the null model contained a two-way interaction between word presentation and word position as well as a main effect of word length. Thus, the two models only differ in terms of whether there is a three-way interaction. The Bayes factor is then computed as exp[(BICalt − BICnull)/2]. This analysis showed that there is strong evidence for the null, where the Bayes Factor was found to be exp(25.65) which is more than 1011. Thus, there is no power issue here, and there is strong evidence for the null claim that word length did not interact with other factors in Experiment 2.

      There is another issue that you mentioned, of whether we should conduct pairwise comparisons if the omnibus interaction did not reach significance. This would be true given the original analysis plan, but we believe that a revised analysis plan makes more sense. In the revised analysis plan for Experiment 2, we start with the three-way interaction (as just described in the last paragraph). The three-way interaction was not significant, and after dropping the third interaction terms, the two-way interaction and the main effect of word length are both significant, and we use this as the overall model. Testing the significance of the omnibus interaction between presentation and syllable position, we found that this was significant (χ<sup>2</sup>(3) =49.77, p<0.001). This represents that, in one model, that the interaction between presentation and syllable position using data from both disyllabic and trisyllabic words. This was in addition to a significant fixed effect of word length (β=0.018, z=6.19, p<0.001). This should motivate the rest of the planned analysis, which regards pairwise comparisons in different word length conditions.

      (6) The results plotted in Figure 2 seem to suggest that RTs to the first syllable of a trisyllabic item slow down with additional word presentations, while RTs to the final position speed up. If anything, in this figure, the magnitude of the effect seems to be greater for 1st syllable positions (e.g., the RT difference between presentation 1 and 4 for syllable position 1 seems to be numerically larger than for syllable position 3, Figure 2D). Thus, it was quite surprising to see in the results (p. 16) that RTs for syllable position 1 were not significantly different for presentation 1 vs. the later presentations (but that they were significant for positions 2 and 3 given the same comparison). Is this possibly a power issue? Would there be a significant slowdown to 1st syllables if results from both the exact replication and conceptual replication conditions were combined in the same analysis?

      Thanks for the suggestion and your careful visual inspection of the data. After combining the data, the slowdown to 1st syllables is indeed significant. We have reported this in the results of Experiment 1 (with an acknowledgement to this review):

      Results showed that later presentations took significantly longer to respond to compared to the first presentation (χ<sup>2</sup>(3) = 10.70, p=0.014), where the effect grew larger with each presentation (second presentation: β=0.011, z=1.82, p=0.069; third presentation: β=0.019, z=2.40, p=0.016; fourth presentation: β=0.034, z=3.23, p=0.001).

      (7) It is difficult to evaluate the description of the PARSER simulation on page 36. Perhaps this simulation should be introduced earlier in the methods and results rather than in the discussion only.

      Thanks for the suggestions. We have added two separate simulations in the paper, which should describe the PARSER simulations sufficiently, as well as provide further information on the correspondence between the simulations and the experiments. Thanks again for the great review! We believe our paper has improved significantly as a result.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.

      One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.

      Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.

      To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.

      Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw a correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.

      The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.

      Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.

      There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.

      Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.

      We thank the reviewer for their thoughtful and thorough evaluation of our study. We appreciate their recognition of the strengths of our approach, particularly the ability to modulate gene dosage within a physiological range and to capture non-linear dosage-response relationships. We also agree with the reviewer’s points regarding the limitations of gene selection and the use of K562 cells, and we are encouraged that the reviewer found our follow-up analyses and statistical framework to be well-supported. We believe this work provides a valuable foundation for future genome-wide applications and more physiologically relevant perturbation studies.

      Reviewer #2 (Public review):

      Summary:

      This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2, and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.

      We thank the reviewer for their detailed and thoughtful assessment of our work. We are encouraged by their recognition of the strengths of our study, including the value of quantitative CRISPR-based perturbation coupled with single-cell transcriptomics, and its potential to inform gene regulatory network inference. Below, we address each of the concerns raised:

      Strengths:

      The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose responses, a general approach that likely can be built upon in the future.

      Weaknesses:

      (1) The experiment was only performed in a single replicate. In the absence of an independent validation of the main findings, the robustness of the observations remains unclear.

      We acknowledge that our study was performed in a single pooled experiment. While additional replicates would certainly strengthen the findings, in high-throughput single-cell CRISPR screens, individual cells with the same perturbation serve as effective internal replicates. This is a common practice in the field. Nevertheless, we agree that biological replicates would help control for broader technical or environmental effects.

      (2) The analysis is based on the calculation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. How the fold changes were calculated exactly remains unclear, since it is only stated that the FindMarkers function from the Seurat package was used, which is likely not optimal for quantitative estimates. Furthermore, differential gene expression analysis of scRNA-seq data can suffer from data distortion and mis-estimations (Heumos et al. 2023 (https://doi.org/10.1038/s41576-023-00586-w), Nguyen et al. 2023 (https://doi.org/10.1038/s41467-023-37126-3)). In general, the pseudo-bulk approach used is suitable, but the correct treatment of drop-outs in the scRNA-seq analysis is essential.

      We thank the reviewer for highlighting recent concerns in the field. A study benchmarking association testing methods for perturb-seq data found that among existing methods, Seurat’s FindMarkers function performed the best (T. Barry et al. 2024).

      In the revised Methods, we now specify the formula used to calculate fold change and clarify that the estimates are derived from the Wilcoxon test implemented in Seurat’s FindMarkers function. We also employed pseudo-bulk grouping to mitigate single-cell noise and dropout effects.

      (3) Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Figure S3A) suggests substantial clonal differences between the two lines. Similarly, the PCA in S4A suggests strong batch effects between the two lines. These might confound this analysis.

      We agree that baseline differences between CRISPRi and CRISPRa lines could introduce confounding effects if not appropriately controlled for. We emphasize that all comparisons are made as fold changes relative to non-targeting control (NTC) cells within each line, thereby controlling for batch- and clone-specific baseline expression. See figures S4A and S4B.

      (4) The study uses pseudo-bulk analysis to estimate the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Figure S3A shows that upon stronger knock-down, a subpopulation of cells appears, where the targeted TF is not detected anymore (drop-outs). Also Figure 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).

      Figure S3A shows normalized expression values, not fold changes. A pseudobulk approach reduces single-cell noise and dropout effects. To test whether dropout events reflect true binary repression or technical effects, we compared trans-effects across cells with zero versus low-but-detectable target gene expression (Figure S3B). These effects were highly concordant, supporting the interpretation that dropout is largely technical in origin. We agree that KRAB-based repression can exhibit binary behavior in some contexts, but our data suggest that cells with intermediate repression exist and are biologically meaningful. In ongoing unpublished work, we pursue further analysis of these data at the single cell level, and show that for nearly all guides the dosage effects are indeed gradual rather than driven by binary effects across cells.

      (5) One of the major conclusions of the study is that non-linear behavior is common. This is not surprising for gene up-regulation, since gene expression will reach a plateau at some point, but it is surprising to be observed for many genes upon TF down-regulation. Specifically, here the target gene responds to a small reduction of TF dose but shows the same response to a stronger knock-down. It would be essential to show that his observation does not arise from the technical concerns described in the previous point and it would require independent experimental validations.

      This phenomenon—where relatively small changes in cis gene dosage can exceed the magnitude of cis gene perturbations—is not unique to our study. This also makes biological sense, since transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Empirically, these effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022), to name but a few studies that our lab has personally examined the data of.

      (6) One of the conclusions of the study is that guide tiling is superior to other methods such as sgRNA mismatches. However, the comparison is unfair, since different numbers of guides are used in the different approaches. Relatedly, the authors point out that tiling sometimes surpassed the effects of TSS-targeting sgRNAs, however, this was the least fair comparison (2 TSS vs 10 tiling guides) and additionally depends on the accurate annotation of TSS in the relevant cell line.

      We do not draw this conclusion simply from observing the range achieved but from a more holistic observation. We would like to clarify that the number of sgRNAs used in each approach is proportional to the number of base pairs that can be targeted in each region: while the TSS-targeting strategy is typically constrained to a small window of a few dozen base pairs, tiling covers multiple kilobases upstream and downstream, resulting in more guides by design rather than by experimental bias. The guides with mismatches do not have a great performance for gradual upregulation.

      We would also like to point out that the observation that the strongest effects can arise from regions outside the annotated TSS is not unique to our study and has been demonstrated in prior work (referenced in the text).

      To address this concern, we have revised the text to clarify that we do not consider guide tiling to be inherently superior to other approaches such as sgRNA mismatches. Rather, we now describe tiling as a practical and straightforward strategy to obtain a wide range of gene dosage effects without requiring prior knowledge beyond the approximate location of the TSS. We believe this rephrasing more accurately reflects the intent and scope of our comparison.

      (7) Did the authors achieve their aims? Do the results support the conclusions?: Some of the most important conclusions are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.

      We appreciate the reviewer’s concern, but we would have wished for a more detailed characterization of which conclusions are not supported, given that we believe our approach actually accounts for the major concerns raised above. We believe that the observation of non-linear effects is a robust conclusion that is also consistent with known biology, with this paper introducing new ways to analyze this phenomenon.

      (8) Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:

      Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. Missing documentation of the computational code repository reduces the utility of the methods and data significantly.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      Reviewer #1 (Recommendations for the authors):

      In Figure 3C (and similar plots of dosage response curves throughout the manuscript), we initially misinterpreted the plots because we assumed that the zero log fold change on the horizontal axis was in the middle of the plot. This gives the incorrect interpretation that the trans genes are insensitive to loss of GFI1B in Figure 3C, for instance. We think it may be helpful to add a line to mark the zero log fold change point, as was done in Figure 3A.

      We thank the reviewer for this helpful suggestion. To improve clarity, we have added a vertical line marking the zero log fold change point in Figure 3C and all similar dosage-response plots. We agree this makes the plots easier to interpret at a glance.

      Similarly, for heatmaps in the style of Figure 3B, it may be nice to have a column for the non-targeting controls, which should be a white column between the perturbations that increase versus decrease GFI1B.

      We appreciate the suggestion. However, because all perturbation effects are computed relative to the non-targeting control (NTC) cells, explicitly including a separate column for NTC in the heatmap would add limited interpretive value and could unnecessarily clutter the figure. For clarity, we have emphasized in the figure legend that the fold changes are relative to the NTC baseline.

      We found it challenging to assess the degree of uncertainty in the estimation of log fold changes throughout the paper. For example, the authors state the following on line 190: "We observed substantial differences in the effects of the same guide on the CRISPRi and CRISPRa backgrounds, with no significant correlation between cis gene fold-changes." This claim was challenging to assess because there are no horizontal or vertical error bars on any of the points in Figure 2A. If the log fold change estimates are very noisy, the data could be consistent with noisy observations of a correlated underlying process. Similarly, to our understanding, the dosage response curves are fit assuming that the cis log fold changes are fixed. If there is excessive noise in the estimation of these log fold changes, it may bias the estimated curves. It may be helpful to give an idea of the amount of estimation error in the cis log fold changes.

      We agree that assessing the uncertainty in log fold change estimates is important for interpreting both the lack of correlation between CRISPRi and CRISPRa effects (Figure 2A) and the robustness of the dosage-response modeling.

      In response, we have now updated Figure 2A to include both vertical and horizontal error bars, representing the standard errors of the log2 fold-change estimates for each guide in the CRISPRi and CRISPRa conditions. These error estimates were computed based on the differential expression analysis performed using the FindMarkers function in Seurat, which models gene expression differences between perturbed and control cells. We also now clarify this in the figure legend and methods.

      The authors mention hierarchical clustering on line 313, which identified six clusters. Although a dendrogram is provided, these clusters are not displayed in Figure 4A. We recommend displaying these clusters alongside the dendrogram.

      We have added colored bars indicating the clusters to improve the clarity. Thank you for the suggestion.

      In Figures 4B and 4C, it was not immediately clear what some of the gene annotations meant. For example, neither the text nor the figure legend discusses what "WBCs", "Platelets", "RBCs", or "Reticulocytes" mean. It would be helpful to include this somewhere other than only the methods to make the figure more clear.

      To improve clarity, we have updated the figure legends for Figures 4B and 4C to explicitly define these abbreviations.

      We struggled to interpret Figure 4E. Although the authors focus on the association of MYB with pHaplo, we would have appreciated some general discussion about the pattern of associations seen in the figure and what the authors expected to observe.

      We have changed the paragraph to add more exposition and clarification:

      “The link between selective constraint and response properties is most apparent in the MYB trans network. Specifically, the probability of haploinsufficiency (pHaplo) shows a significant negative correlation with the dynamic range of transcriptional responses (Figure 4G): genes under stronger constraint (higher pHaplo) display smaller dynamic ranges, indicating that dosage-sensitive genes are more tightly buffered against changes in MYB levels. This pattern was not reproduced in the other trans networks (Figure 4E)”.

      Line 71: potentially incorrect use of "rending" and incorrect sentence grammar.

      Fixed

      Line 123: "co-expression correlation across co-expression clusters" - authors may not have intended to use "co-expression" twice.

      Original sentence was correct.

      Line 246: "correlations" is used twice in "correlations gene-specific correlations."

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      (1) To show that the approach indeed allows gradual down-regulation it would be important to quantify the know-down strength with a single-cell readout for a subset of sgRNAs individually (e.g. flowfish/protein staining flow cytometry).

      We agree that single-cell validation of knockdown strength using orthogonal approaches such as flowFISH or protein staining would provide additional support. However, such experiments fall outside the scope of the current study and are not feasible at this stage. We note that the observed transcriptomic changes and dosage responses across multiple perturbations are consistent with effective and graded modulation of gene expression.

      (2) Similarly, an independent validation of the observed dose-response relationships, e.g. with individual sgRNAs, can be helpful to support the conclusions about non-linear responses.

      Fig. S4C includes replication of trans-effects for a handful of guides used both in this study and in Morris et al. While further orthogonal validation of dose-response relationships would be valuable, such extensive additional work is not currently feasible within the scope of this study. Nonetheless, the high degree of replication in Fig. S4C as well as consistency of patterns observed across multiple sgRNAs and target genes provides strong support for the conclusions drawn from our high-throughput screen.

      (3) The calculation of the log2 fold changes should be documented more precisely. To perform a pseudo-bulk analysis, the raw UMI counts should be summed up in each group (NTC, individual targeting sgRNAs), including zero counts, then the data should be normalized and the fold change should be calculated. The DESeq package for example would be useful here.

      We have updated the methods in the manuscript to provide more exposition of how the logFC was calculated:

      “In our differential expression (DE) analysis, we used Seurat’s FindMarkers() function, which computes the log fold change as the difference between the average normalized gene expression in each group on the natural log scale:

      Logfc = log_e(mean(expression in group 1)) - log_e(mean(expression in group 2))

      This is calculated in pseudobulk where cells with the same sgRNA are grouped together and the mean expression is compared to the mean expression of cells harbouring NTC guides. To calculate per-gene differential expression p-value between the two cell groups (cells with sgRNA vs cells with NTC), Wilcoxon Rank-Sum test was used”.

      (4) A more careful characterization of the cell lines used would be helpful. First, it would be useful to include the quality controls performed when the clonal lines were selected, in the manuscript. Moreover, a transcriptome analysis in comparison to the parental cell line could be performed to show that the cell lines are comparable. In addition, it could be helpful to perform the analysis of the samples separately to see how many of the response behaviors would still be observed.

      Details of the quality control steps used during the selection of the CRISPRa clonal line are already included in the Methods section, and Fig. S4A shows the transcriptome comparison of CRISPRi and CRISPRa lines also for non-targeting guides. Regarding the transcriptomic comparison with the parental cell line, we agree that such an analysis would be informative; however, this would require additional experiments that are not feasible within the scope of the current study. Finally, while analyzing the samples separately could provide further insight into response heterogeneity, we focused on identifying robust patterns across perturbations that are reproducible in our pooled screening framework. We believe these aggregate analyses capture the major response behaviors and support the conclusions drawn.

      (5) In general we were surprised to see such strong responses in some of the trans genes, in some cases exceeding the fold changes of the cis gene perturbation more than 2x, even at the relatively modest cis gene perturbations (Figures S5-S8). How can this be explained?

      This phenomenon—where trans gene responses can exceed the magnitude of cis gene perturbations—is not unique to our study. Similar effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022).

      Several factors may contribute to this pattern. One possibility is that certain trans genes are highly sensitive to transcription factor dosage, and therefore exhibit amplified expression changes in response to relatively modest upstream perturbations. Transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Mechanistically, this may involve non-linear signal propagation through regulatory networks, in which intermediate regulators or feedback loops amplify the downstream transcriptional response. While our dataset cannot fully disentangle these indirect effects, the consistency of this observation across multiple studies suggests it is a common feature of transcriptional regulation in K562 cells.

      (6) In the analysis shown in Figure S3B, the correlation between cells with zero count and >0 counts for the cis gene is calculated. For comparison, this analysis should also show the correlation between the cells with similar cis-gene expression and between truly different populations (e.g. NTC vs strong sgRNA).

      The intent of Figure S3B was not to compare biologically distinct populations or perform differential expression analyses—which we have already conducted and reported elsewhere in the manuscript—but rather to assess whether fold change estimates could be biased by differences in the baseline expression of the target gene across individual cells. Specifically, we sought to determine whether cells with zero versus non-zero expression (as can result from dropouts or binary on/off repression from the KRAB-based CRISPRi system) exhibit systematic differences that could distort fold change estimation. As such, the comparisons suggested by the reviewer do not directly relate to the goal of the analysis which Figure S3B was intended to show.

      (7) It is unclear why the correlation between different lanes is assessed as quality control metrics in Figure S1C. This does not substitute for replicates.

      The intent of Figure S1C was not to serve as a general quality control metric, but rather to illustrate that the targeted transcript capture approach yielded consistent and specific signal across lanes. We acknowledge that this may have been unclear and have revised the relevant sentence in the text to avoid misinterpretation.

      “We used the protein hashes and the dCas9 cDNA (indicating the presence or absence of the KRAB domain) to demultiplex and determine the cell line—CRISPRi or CRISPRa. Cells containing a single sgRNA were identified using a Gaussian mixture model (see Methods). Standard quality control procedures were applied to the scRNA-seq data (see Methods). To confirm that the targeted transcript capture approach worked as intended, we assessed concordance across capture lanes (Figure S1C)”.

      (8) Figures and legends often miss important information. Figure 3B and S5-S8: what do the transparent bars represent? Figure S1A: color bar label missing. Figure S4D: what are the lines?, Figure S9A: what is the red line? In Figure S8 some of the fitted curves do not overlap with the data points, e.g. PKM. Fig. 2C: why are there more than 96 guide RNAs (see y-axis)?

      We have addressed each point as follows:

      Figure 3B: The figure legend has been updated to clarify the meaning of the transparent bars.

      Figures S5–S8: There are no transparent bars in these figures; we confirmed this in the source plots.

      Figure S1A: The color bar label is already described in the figure legend, but we have reformulated the caption text to make this clearer.

      Figure S4D: The dashed line represents a linear regression between the x and y variables. The figure caption has been updated accordingly.

      Figure S9A: We clarified that the red line shows the median ∆AIC across all genes and conditions.

      Figure S8: We agree that some fitted curves (e.g., PKM) do not closely follow the data points. This reflects high noise in these specific measurements; as noted in the text, TET2 is not expected to exert strong trans effects in this context.

      Figure 2C: Thank you for catching this. The y-axis numbers were incorrect because the figure displays the proportion of guides (summing to 100%), not raw counts. We have corrected the y-axis label and updated the numbers in the figure to resolve this inconsistency.

      (9) The code is deposited on Github, but documentation is missing.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      (10) The methods miss a list of sgRNA target sequences.

      We thank the reviewer for this observation. A complete table containing all processed data, including the sequences of the sgRNAs used in this study, is available at the following GEO link:

      https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE257547&format=file&file=GSE257547%5Fd2n%5Fprocessed%5Fdata%2Etxt%2Egz

      (11) In some parts, the language could be more specific and/or the readability improved, for example:

      Line 88: "quantitative landscape".

      Changed to “quantitative patterns”.

      Lines 88-91: long sentence hard to read.

      This complex sentence was broken up into two simpler ones:

      “We uncovered quantitative patterns of how gradual changes in transcription dosage lead to linear and non-linear responses in downstream genes. Many downstream genes are associated with rare and complex diseases, with potential effects on cellular phenotypes”.

      Line 110: "tiling sgRNAs +/- 1000 bp from the TSS", could maybe be specified by adding that the average distance was around 100 or 110 bps?

      Lines 244-246: hard to understand.

      We struggle to see the issue here and are not sure how it can be reworded.

      Lines 339-342: hard to understand.

      These sentences have been reworded to provide more clarity.

      (12) A number of typos, and errors are found in the manuscript:

      Line 71: "SOX2" -> "SOX9".

      FIXED

      Line 73: "rending" -> maybe "raising" or "posing"?

      FIXED

      Line 157: "biassed".

      FIXED

      Line 245: "exhibited correlations gene-specific correlations with".

      FIXED

      Multiple instances, e.g. 261: "transgene" -> "trans gene".

      FIXED

      Line 332: "not reproduced with among the other".

      FIXED

      Figure S11: betweenness.

      This is the correct spelling

      There are more typos that we didn't list here.

      We went through the manuscript and corrected all the spelling errors and typos.

    1. While there is no easy exit from the morass of racial politics inNorth America and the roles assigned to teachers of writing, reading,and speaking within that morass, there are alternatives to thoughtlesslygoing along. If there is insufficient work within the field of writing stud-ies to teach us how to think more deeply and effectively about antiracistpedagogical practice in the writing centre, then perhaps we may findaid in published scholarship outside the field, as well as inspiration anda firmer footing for producing our own. In this regard, two recentlypublished books stand out to me as offering both a richly developedtheoretical framework and teaching advice that can easily be transferredfrom the classroom to the writing centre context: Other People’s English:Code-Meshing, Code-Switching, and African American Literacy, written byVershawn Ashanti Young, Rusty Barrett, Y’Shanda Young-Rivera,& Kim Brian Lovejoy (2014) (published by Teachers College Press),and Survivance, Sovereignty, and Story: Teaching American Indian Rhetorics,edited by Lisa King, Rose Gubele, & Joyce Rain Anderson (2015b)(published by Utah State University Press).

      its hard to escape racism in writing education. Teachers can use existing scholarship to learn how to teach fairly.

    1. Standard English today Although language changes all the time – think of new words like Internet, Web site, and so on – we still use Standard English as the formal form of our language. Standard English is the form that is taught in schools, following set rules of grammar and spelling. Newspapers are written in Standard English and it is used by newsreaders on national television, who need to be understood by people with different local dialects, all over the country.For some people, it is not difficult to use Standard English, because it happens to be their local dialect. But for others in different parts of the country, they may have to remind themselves to follow the rules, including the sentence order and grammar of Standard English, when they are speaking or writing in a formal context. However, Standard English can be spoken in any accent, and must not be confused with talking ‘posh’.

      Different ways to use a formal and informal standard English.

  5. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. Meg van Achterberg. Jimmy Kimmel’s Halloween prank can scar children. Why are we laughing? Washington Post, October 2017. URL: https://www.washingtonpost.com/outlook/jimmy-kimmel-wants-to-prank-kids-why-are-we-laughing/2017/10/20/9be17716-aed0-11e7-9e58-e6288544af98_story.html (visited on 2023-12-10).

      While reading this article, I thought of those moments when people said "just kidding", but for children, it was not just a joke. Adults might find pranks amusing, but the fear and humiliation children feel at that moment are one hundred percent real. Especially when they are recorded by cameras, posted online, and shown to strangers as a joke, that sense of powerlessness may linger in their hearts for a long time. Children cannot understand that "this is entertainment", they only think that if even their parents can laugh at them, then who else can they trust? This made me realize that laughter and hurt are sometimes separated by only a very thin line, and we often cross it when children are at their most vulnerable.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Revision Plan

      Manuscript number: RC-2025-03208

      Corresponding author(s): Jared Nordman

      [The "revision plan" should delineate the revisions that authors intend to carry out in response to the points raised by the referees. It also provides the authors with the opportunity to explain their view of the paper and of the referee reports.

      • *

      The document is important for the editors of affiliate journals when they make a first decision on the transferred manuscript. It will also be useful to readers of the reprint and help them to obtain a balanced view of the paper.

      • *

      If you wish to submit a full revision, please use our "Full Revision" template. It is important to use the appropriate template to clearly inform the editors of your intentions.]

      1. General Statements [optional]

      All three reviewers of our manuscript were very positive about our work. The reviewers noted that our work represents a necessary advance that is timely, addresses important issues in the chromatin field, and will of broad interest to this community. Given the nature of our work and the positive reviews, we feel that this manuscript would best be suited for the Journal of Cell Biology.

      2. Description of the planned revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary:

      The authors investigate the function of the H3 chaperone NASP, which is known to bind directly to H3 and prevent degradation of soluble H3. What is unclear is where NASP functions in the cell (nucleus or cytoplasm), how NASP protects H3 from degradation (direct or indirect), and if NASP affects H3 dynamics (nuclear import or export). They use the powerful model system of Drosophila embryos because the soluble H3 pool is high due to maternal deposition and they make use of photoconvertable Dendra-tagged proteins, since these are maternally deposited and can be used to measure nuclear import/export rates.

      Using these systems and tools, they conclude that NASP affects nuclear import, but only indirectly, because embryos from NASP mutant mothers start out with 50% of the maternally deposited H3. Because of the depleted H3 and reduced import rates, NASP deficient embryos also have reduced nucleoplasmic and chromatin-associated H3. Using a new Dendra-tagged NASP allele, the authors show that NASP and H3 have different nuclear import rates, indicating that NASP is not a chaperone that shuttles H3 into the nucleus. They test H3 levels in embryos that have no nuclei and conclude that NASP functions in the cytoplasm, and through protein aggregation assays they conclude that NASP prevents H3 aggregation.

      Major comments:

      The text was easy to read and logical. The data are well presented, methods are complete, and statistics are robust. The conclusions are largely reasonable. However, I am having trouble connecting the conclusions in text to the data presented in Figure 4.

      First, I'm confused why the conclusion from Figure 4A is that NASP functions in the cytoplasm of the egg. Couldn't NASP be required in the ovary (in, say, nurse cell nuclei) to stimulate H3 expression and deposition into the egg? The results in 4A would look the same if the mothers deposit 50% of the normal H3 into the egg. Why is NASP functioning specifically in the cytoplasm when it is also so clearly imported into the nucleus? Maybe NASP functions wherever it is, and by preventing nuclear import, you force it to function in the cytoplasm. I do not have additional suggestions for experiments, but I think the authors need to be very clear about the different interpretations of these data and to discuss WHY they believe their conclusion is strongest.

      The concern raised by the reviewer regarding NASP function during oogenesis has been addressed in a previous work published from our lab. Unfortunately, we did not do a good job conveying this work in the original version of this manuscript. We demonstrated that total H3 levels are unaffected when comparing WT and NASP mutant stage 14 egg chambers. This means that the amount of H3 deposited into the eggs does not change in the absence of NASP. To address the reviewer's comment, we will change the text to make the link to our previous work clear.

      Second, an alternate conclusion from Figure 4D/E is that mothers are depositing less H3 protein into the egg, but the same total amount is being aggregated. This amount of aggregated protein remains constant in activated eggs, but additional H3 translation leads to more total H3? The authors mention that additional translation can compensate for reduced histone pools (line 416).

      Similar to our response above, the total amount of H3 in wild type and NASP mutant stage 14 egg chambers is the same. Therefore, mothers are depositing equal amounts of H3 into the egg. We will make the necessary changes in the text to make this point clear.

      As the function of NASP in the cytoplasm (when it clearly imports into the nucleus) and role in H3 aggregation are major conclusions of the work, the authors need to present alternative conclusions in the text or complete additional experiments to support the claims. Again, I do not have additional suggestions for experiments, but I think the authors need to be very clear about the different interpretations of these data and to discuss WHY they believe their conclusion is strongest.

      A common issue raised by all three reviewers was to more convincingly demonstrate that assay that we have used to isolate protein aggregates does, in fact, isolate protein aggregates. To verify this, we will be performing the aggregate isolation assay using controls that are known to induce more protein aggregation. We will perform the aggregation assay with egg chambers or extracts that are exposed to heat shock or the aggregation-inducing chemicals Canavanine and Azetidine-2-carboxylic acid. The chemical treatment was a welcome suggestion from reviewer #3. These experiments will significantly strengthen any claims based on the outcome of the aggregation assay.

      We will also make changes to the text and include other interpretations of our work as the reviewer has suggested.

      Data presentation:

      Overall, I suggest moving some of the supplemental figures to the main text, adding representative movie stills to show where the quantitative data originated, and moving the H3.3 data to the supplement. Not because it's not interesting, but because H3.3 and H3.2 are behaving the same.

      Where possible, we will make changes to the figure display to improve the logic and flow of the manuscript

      Fig 1:

      It would strengthen the figure to include representative still images that led to the quantitative data, mostly so readers understand how the data were collected.

      We will add representative stills to Figure 1 to help readers understand how the data is collected. We will also a representative H3-Dendra movie similar to the NASP supplemental movie.

      The inclusion of a "simulated 50% H3" in panel C is confusing. Why?

      We used a 50% reduction in H3 levels because that is reduction in H3 we measure in embryos laid by NASP-mutant mothers in our previous work. A reduction in H3 levels alone would be predicted to change the nuclear import rate of H3. Thus, having a quantitative model of H3 import kinetics was key in our understanding of NASP function in vivo. We will revise the text to make this clear.

      I would also consider normalizing the data between A and B (and C and D) by dividing NASP/WT. This could be included in the supplement (OPTIONAL)

      We can normalize the values and include the data in a supplemental figure.

      Fig S1:

      The data simulation S1G should be moved to the main text, since it is the primary reason the authors reject the hypothesis that NASP influences H3 import rates.

      This is a good point. We will move S1G into the Figure 1.

      Fig 2:

      Once again, I think it would help to include a few representative images of the photoconverted Dendra2 in the main text.

      We will add representative images of the photoconversion in Figure 2.

      I struggled with A/B, I think due to not knowing how the data were normalized. When I realized that the WT and NASP data are not normalized to each other, but that the NASP values are likely starting less than the WT values, it made way more sense. I suggest switching the order of data presentation so that C-F are presented first to establish that there is less chromatin-bound H3 in the first place, and then present A/B to show no change in nuclear export of the H3 that is present, allowing the conclusion of both less soluble AND chromatin-bound H3.

      The order of the presentation of the data was to test if NASP was acting as a nuclear receptor. Since Figure 1 compares the nuclear import, we wanted to address the nuclear export and provide a comprehensive analysis of the role of NASP in H3 nuclear dynamics before advancing on to other consequences of NASP depletion. We can add the graphs with the un-normalized values in the Supplemental Figure to show the actual difference in total intensity values.

      Fig S2:

      If M1-M3 indicate males, why are the ovaries also derived from males? I think this is just confusing labeling.

      We will change the labelling.

      Supplemental Movie S1:

      Beautiful. Would help to add a time stamp (OPTIONAL).

      Thank you! We will add the time stamp to the movie

      Fig 3:

      Panel C is the same as Fig S1A (not Fig 1A, as is said in the legend), though I appreciate the authors pointing it out in the legend. Also see line 276.

      We appreciate the reviewer for pointing this out. We will make the change in the text to correct this.

      Panel D is a little confusing, because presumably the "% decrease in import rate" cannot be positive (Y axis). This could be displayed as a scatter (not bar) as in Panels B/C (right) where the top of the Y axis is set to 0.

      We understand the reviewer's concern that the decrease value cannot be positive. We can adjust the y-axis so that it caps off at 0.

      Fig S3:

      A: What do the different panels represent? I originally thought developmental time, but now I think just different representative images? Are these age-matched from time at egg lay?

      The different panels show representative images. We can clarify that in the figure legend.

      C: What does "embryos" mean? Same question for Fig 4A.

      In this figure, embryos mean the exact number of embryos used to form the lysate for the western blot. We will clarify this in the figure legend.

      Fig 4:

      A: What does "embryos" mean? Number of embryos? Age in hours?

      In this figure, embryos mean the exact number of embryos used to form the lysate for the western blot. We will clarify this in the figure legend.

      C: Not sure the workflow figure panel is necessary, as I can't tell what each step does. This is better explained in methods. However I appreciated the short explanation in the text (lines 314-5).

      The workflow panel helps to identify the samples labelled as input and aggregate for the western blot analysis. Since our input in the western blots does not refer to the total protein lysate, we feel it is helpful to point out exactly what stage at the protocol we are utilizing the sample for our analysis.

      Minor comments:

      The authors should describe the nature of the NASP alleles in the main text and present evidence of robust NASP depletion, potentially both in ovaries and in embryos. The antibody works well for westerns (Fig S2B). This is sort of demonstrated later in Figure 4A, but only in NAAP x twine activated eggs.

      We appreciate the reviewer's comments about the NASP mutant allele. In our previous publication, we characterized the NASP mutant fly line and its effect on both stage 14 egg chambers and the embryos. We will emphasize the reference to our previous work in the text.

      Lines 163, 251, 339: minor typos

      Line 184: It would help to clarify- I'm assuming cytoplasmic concentration (or overall) rather than nuclear concentration. If nuclear, I'd expect the opposite relationship. This occurs again when discussing NASP (line 267). I suspect it's also not absolute concentration, but relative concentration difference between cytoplasm and nucleus. It would help clarify if the authors were more precise.

      We appreciate the reviewer's point and will add the clarification in the text.

      Line 189: Given that the "established integrative model" helps to reject the hypothesis that NASP is involved in H3 import, I think it's important to describe the model a little more, even though it's previously published.

      We will add few sentences giving a brief description of the model to the text.

      Line 203: "The measured rate of H3.2 export from the nucleus is negligible" clarify this is in WT situations and not a conclusion from this study.

      We will add the clarification of this statement in the text.

      Line 211: How can the authors be so sure that the decrease in WT is due to "the loss of non-chromatin bound nucleoplasmic H3.2-Dendra2?"

      From the live imaging experiments, the H3.2-Dendra2 intensity in the nucleus reduces dramatically upon nuclear envelope breakdown with the only H3.2-Dendra2 intensity remaining being the chromatin bound H3.2. Excess H3.2 is imported into the nucleus and not all of it is incorporated into the chromatin. This is a unique feature of the embryo system that has been observed previously. We mention that the intensity reduction is due to the loss of non-chromatin bound nucleoplasmic H3.2.

      Line 217: In the conclusion, the authors indicate that NASP indirectly affects soluble supply of H3 in the nucleoplasm. I do believe they've shown that the import rate effect is indirect, but I don't know why they conclude that the effect of NASP on the soluble nucleoplasmic H3 supply is indirect. Similarly, the conclusion is indirect on line 239. Yet, the authors have not shown it's not direct, just assumed since NASP results in 50% decrease to deposited maternal histones.

      We appreciate the feedback on the conclusions of Figure 2 from the reviewer. Our conclusions are primarily based on the effect of H3 levels in the absence of NASP in the early embryos. To establish direct causal effects, it would be important to recover the phenotypes by complementation experiments and providing molecular interactions to cause the effects. In this study we have not established those specific details to make conclusions of direct effects. We will change the text to make this more clear.

      Line 292: What is the nature of the NASP "mutant?" Is it a null? Similarly, what kind of "mutant" is the twine allele? Line 295.

      We will include descriptions of the NASP and twine mutants in the text.

      Line 316: Why did the authors use stage 14 egg chambers here when they previously used embryos? This becomes more clear later shortly, when the authors examine activated eggs, but it's confusing in text.

      The reason to use stage 14 egg chambers was to establish NASP function during oogenesis. We will modify the text to emphasize the reason behind using stage 14 egg chambers.

      Lines 343-348: It's unclear if the authors are drawing extended conclusions here or if they are drawing from prior literature (if so, citations would be required). For example, why during oogenesis/embryogenesis are aggregation and degradation developmentally separated?

      This conclusion is based primarily based on the findings from this study (Figure 4) and out previous published work. We will modify the text for more clarity.

      Lines 386-7: I do not understand why the authors conclude that H3 aggregation and degradation are "developmentally uncoupled" and why, in the absence of NASP, "H3 aggregation precedes degradation."

      This is based data in Figure 4 combined with our previous working showing that the total level of H3 in not changed in NASP-mutant stage 14 egg chambers. Aggregates seem to be more persistent in the stage 14 egg chambers (oogenesis) and they get cleared out upon egg activation (entry into embryogenesis). This provides evidence for aggregation occurring prior to degradation and these two events occurring in different developmental stages. We will change the text to make this more clear.

      Line 395: Why suddenly propose that NASP also functions in the nucleus to prevent aggregation, when earlier the authors suggest it functions only in the cytoplasm?

      We will make the necessary edits to ensure that the results don't suggest a role of NASP exclusive to the cytoplasm. Our findings highlight a cytoplasmic function of NASP, however, we do not want to rule out that this same function couldn't occur in the nucleus.

      Lines 409-413: The authors claim that histone deficiency likely does not cause the embryonic arrest seen in embryos from NASP mutant mothers. This is because H3 is reduced by 50% yet some embryos arrest long before they've depleted this supply. However, the authors also showed that H3 import rates are affected in these embryos due to lower H3 concentration. Since the early embryo cycles are so rapid, reduced H3 import rates could lead to early arrest, even though available H3 remains in the cytoplasm.

      We thank the reviewer for their suggestion. This conclusion is based on the findings from the previous study from our lab which showed that the majority of the embryos laid by NASP mutant females get arrested in the very early nuclear cycles (Reviewer #1 (Significance (Required)):

      The significance of the work is conceptual, as NASP is known to function in H3 availability but the precise mechanism is elusive. This work represents a necessary advance, especially to show that NASP does not affect H3 import rates, nor does it chaperone H3 into the nucleus. However, the authors acknowledge that many questions remain. Foremost, why is NASP imported into the nucleus and what is its role there?

      I believe this work will be of interest to those who focus on early animal development, but NASP may also represent a tool, as the authors conclude in their discussion, to reduce histone levels during development and examine nucleosome positioning. This may be of interest to those who work on chromatin accessibility and zygotic genome activation.

      I am a genetics expert who works in Drosophila embryogenesis. I do not have the expertise to evaluate the aggregate methods presented in Figure 4.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary:

      This manuscript focuses on the role of the histone chaperone NASP in Drosophila. NASP is a chaperone specific to histone H3 that is conserved in mammals. Many aspects of the molecular mechanisms by which NASP selectively binds histone H3 have been revealed through biochemical studies. However, key aspects of NASP's in vivo roles remain unclear, including where in the cell NASP functions, and how it prevents H3 degradation. Through live imaging in the early Drosophila embryo, which possesses large amounts of soluble H3 protein, Das et al determine that NASP does not control nuclear import or export of H3.2 or H3.3. Instead, they find through differential centrifugation analysis that NASP functions in the cytoplasm to prevent H3 aggregation and hence its subsequent degradation.

      Major Comments:

      The protein aggregation assays raise several questions. From a technical standpoint, it would be helpful to have a positive control to demonstrate that the assay is effective at detecting protein aggregates. Ie. a genotype that exhibits increased protein aggregation; this could be for a protein besides H3. A common issue raised by all three reviewers was to more convincingly demonstrate that assay that we have used to isolate protein aggregates does, in fact, isolate protein aggregates. To verify this, we will be performing the aggregate isolation assay using controls that are known to induce more protein aggregation. We will perform the aggregation assay with egg chambers or extracts that are exposed to heat shock or the aggregation-inducing chemicals Canavanine and Azetidine-2-carboxylic acid. The chemical treatment was a welcome suggestion from reviewer #3. These experiments will significantly strengthen any claims based on the outcome of the aggregation assay.

      If NASP is not required to prevent H3 degradation in egg chambers, then why are H3 levels much lower in NASP input lanes relative to wild-type egg chambers in Fig 4D? We appreciate the reviewer's inputs regarding the reduced H3 levels in the NASP mutant egg chambers. We observe this reduction in H3 levels in the input because of the altered solubility of H3 which leads to the loss of H3 protein at different steps of the aggregate isolation assay. We will add a supplement figure showing H3 levels at different steps of the aggregate isolation assay. We do want to stress, however, that the total levels of H3 in stage 14 egg chambers does not change between WT and the NASP mutant.

      A corollary to this is that the increased fraction of H3 in aggregates in NASP mutants seems to be entirely due to the reduction in total H3 levels rather than an increase in aggregated H3. If NASP's role is to prevent aggregation in the cytoplasm, and degradation has not yet begun in egg chambers, then why are aggregated H3 levels not increased in NASP mutants relative to wild-type egg chambers? If the same number of egg chambers were used, shouldn't the total amount of histone be the same in the absence of degradation?

      In previously published work, we demonstrated that total H3 levels are unaffected when comparing WT and NASPmutant stage 14 egg chambers. This means that the amount of H3 deposited into the eggs does not change in the absence of NASP. To address the reviewer's comment, we will change the text to make the link to our previous work clear. As stated above, we will add a supplement figure showing H3 levels at different steps of the aggregate isolation assay.

      The live imaging studies are well designed, executed, and quantified. They use an established genotype (H3.2-Dendra2) in wild-type and NASP maternal mutants to demonstrate that NASP is not directly involved in nuclear import of H3.2. Decreased import is likely due to reduced H3.2 levels in NASP mutants rather than reduced import rates per se. The same methodology was used to determine that loss of NASP did not affect H3.2 nuclear export. These findings eliminate H3.2 nuclear import/export regulation as possible roles for NASP, which had been previously proposed.

      Thank you.

      Live imaging also conclusively demonstrates that the levels of H3.2 in the nucleoplasm and in mitotic chromatin are significantly lower in NASP mutants than wild-type nuclei. Despite these lower histone levels, the nuclear cycle duration is only modestly lengthened. The live imagining of NASP-Dendra2 nuclear import conclusively demonstrate that NASP and H3.2 are unlikely to be imported into the nucleus as one complex.

      Thank you.

      Minor Comments:

      Additional details on how the NASP-Dendra2 CRISPR allele was generated should be provided. In addition, additional details on how it was determined that this allele is functional should be provided (e.g. quantitative assays for fertility/embryo viability of NASP-Dendra2 females) We will make these additions to the text.

      If statistical tests are used to determine significance, the type of test used should be reported in the figure legends throughout.

      We will make the addition of the statistical tests to the figure legends.

      The western blot shown in Figure 4A looks more like a 4-fold reduction in H3 levels in NASP mutants relative to wild-type embryos, rather than the quantified 2-fold reduction. Perhaps a more representative blot can be shown.

      We have additional blots in the supplemental figure S3C. The quantification was performed after normalization to the total protein levels and we can highlight that in the figure legend.

      Reviewer #2 (Significance (Required)):

      As a fly chromatin biologist with colleagues that utilize mammalian experimental systems, I feel this manuscript will be of broad interest to the chromatin research community. Packaging of the genome into chromatin affects nearly every DNA-templated process, making the mechanisms by which histone proteins are expressed, chaperoned, and deposited into chromatin of high importance to the field. The study has multiple strengths, including high-quality quantitative imaging, use of a terrific experimental system (storage and deposition of soluble histones in early fly embryos). The study also answers outstanding questions in the field, specifically that NASP does not control nuclear import/export of histone H3. Instead, the authors propose that NASP functions to prevent protein aggregation. If this could be conclusively demonstrated, it would be valuable to the field. However, the protein aggregation studies need improvement. Technical demonstration that their differential centrifugation assay accurately detects aggregated proteins is needed. Further, NASP mutants do not exhibit increased H3 protein aggregation in the data presented. Instead, the increased fraction of aggregated H3 in NASP mutants seems to be due to a reduction in the overall levels of H3 protein, which is contrary to the model presented in this paper.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This manuscript by Das et al. entitled "NASP functions in the cytoplasm to prevent histone H3 aggregation during early embryogenesis", explores the role of the histone chaperone NASP in regulating histone H3 dynamics during early Drosophila embryogenesis. Using primarily live imaging approaches, the authors found that NASP is not directly involved in the import or export of H3. Moreover, the authors claimed that NASP prevents H3 aggregation rather than protects against degradation.

      Major Comments:

      Figure 1A-B: The plotted data appear to have substantial dispersion. Could the authors include individual data points or provide representative images to help the reader assess variability?

      We chose to show unnormalized data in Figure 1 so readers could better compare the actual import values of H3 in the presence and absence of NASP. We felt it was a better representation of the true biological difference although raw data is more dispersive. We did also include normalized data in the supplement. Regardless, we will add representative stills to Figure 1 and include a H3-Dendra2 movie in the supplement to show the representative data.

      Given that the authors conclude that the reduced nuclear import is due to lowered H3 levels in NASP-deficient embryos, would overexpression of H3 rescue this phenotype? This would directly test whether H3 levels, rather than import machinery per se, drive the effect.

      We thank the reviewer for their valuable suggestion. We and others have tried to overexpress histones in the Drosophila early embryo without success. There must be an undefined feedback mechanism preventing histone overexpression in the germline. In fact, a recent paper has been deposited on bioRxiv (https://doi.org/10.1101/2024.12.23.630206) that suggest H4 protein could provide a feedback mechanism to prevent histone overexpression. While we would love to do this experiment, it is not technically feasible at this time.

      Figure 2A-B: The authors present the Relative Intensity of H3-Dendra2, but this metric obscures absolute differences between Control and NASP knockout embryos. Please include Total Intensity plots to show the actual reduction in H3 levels.

      We will add the total H3-Dendra2 intensity plots to the supplemental figure for the export curves.

      Additionally, Western blot analysis of nucleoplasmic H3 from wild-type vs. NASP-deficient embryos would provide essential biochemical confirmation of H3 level reductions.

      We will measure nuclear H3 levels by western from 0-2 hr embryos laid by WT and NASP mutant flies.

      Figure 4: To support the conclusion that NASP prevents H3 aggregation, I recommend performing aggregation assays by adding compounds that induce unfolding (amino acid analogues that induce unfolding, like canavanine or Azetidine-2-carboxylic acid) or using aggregation-prone H3 mutants.

      This is a very helpful suggestion! It is difficult to get chemicals into Drosophila eggs, but we will treat extracts directly with these chemicals. Additionally, we will use heat shocked eggs and extracts as an additional control.

      Inclusion of CMA and proteasome inhibition experiments could also clarify whether degradation pathways are secondarily involved or compensatory in the absence of NASP.

      The degradation pathway for H3 in the absence of NASP is unknown and a major focus of our future work is to define this pathway. Drosophila does not have a CMA pathway and therefore, we don't know how H3 aggregates are being sensed.

      Minor Comments:

      (1) The Introduction would benefit from mentioning the two NASP isoforms that exist in mammals (sNASP and tNASP), as this evolutionary context may inform interpretation of the Drosophila results.

      We will make the edits in the text to include that Drosophila NASP is the sole homolog of sNASP and that tNASP ortholog is not found in Drosophila.

      (2) Could the authors comment on the status of histone H4 in their experimental system? Given the observed cytoplasmic pool of H3, is it likely to exist as a monomer? If this H3 pool is monomeric, does that suggest an early failure in H3-H4 dimerization, and could this contribute to its aggregation propensity?

      In our previous work we noted that NASP binds more preferentially to H3 and the levels of H3 we much more reduced upon NASP depletion than H4. We pointed out in this publication that our data was consistent with H3 stores being monomeric in the Drosophila embryo. We don't' have a H4-Dendra2 line to test. In the future, however, this is something we are very keen to look at.

      Reviewer #3 (Significance (Required)):

      This work addresses a timely and important question in the field of chromatin biology and developmental epigenetics. The focus on histone homeostasis during embryogenesis and the cytoplasmic role of NASP adds a novel perspective. The live imaging experiments are a clear strength, providing valuable spatiotemporal insights. However, I believe that the manuscript would benefit significantly from additional biochemical validation to support and clarify some of the mechanistic claims.

      3. Description of the revisions that have already been incorporated in the transferred manuscript

      • *

      4. Description of analyses that authors prefer not to carry out

      Please include a point-by-point response explaining why some of the requested data or additional analyses might not be necessary or cannot be provided within the scope of a revision. This can be due to time or resource limitations or in case of disagreement about the necessity of such additional data given the scope of the study. Please leave empty if not applicable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The methods section is overly brief. Even if techniques are cited, more experimental details should be included. For example, since the study focuses heavily on methodology, details such as the number of PCR cycles in RT-PCR or the rationale for choosing HA and PB2 as representative in vitro transcripts should be provided.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407) and have explained the rationale for choosing HA and PB2 as representative transcripts (line 388).

      (2) Information on library preparation and sequencing metrics should be included. For example, the total number of reads, any filtering steps, and quality score distributions/cutoff for the analyzed reads.

      We agree and have added detailed information on library preparation, filtering criteria, quality score thresholds, and sequencing statistics for each sample (line 422, Figure S2).

      (3) In the Results section (line 115, "Quantification of error rate caused by RT"), the mutation rate attributed to viral replication is calculated. However, in line 138, it is unclear whether the reported value reflects PB2, HA, or both, and whether the comparison is based on the error rate of the same viral RNA or the mean of multiple values (as shown in Figure 3A). Please clarify whether this number applies universally to all influenza RNAs or provide the observed range.

      We appreciate this point. We have clarified in the Results (line 140) that the reported value corresponds to PB2.

      (4) Since the T7 polymerase introduced errors are only applied to the in vitro transcription control, how were these accounted for when comparing mutation rates between transcribed RNA and cell-culture-derived virus?

      We agree that errors introduced by T7 RNA polymerase are present only in the in vitro–transcribed RNA control. However, even when taking this into account, the error rate detected in the in vitro transcripts remained substantially lower than that observed in the viral RNA extracted from replicated virus (line 140, Fig.3a). Thus, the difference cannot be explained by T7-derived errors, and our conclusion regarding the elevated mutation rate in cell-culture–derived viral populations remains valid.

      (5) Figure 2 shows that a UMI group size of 4 has an error rate of zero, but this group size is not mentioned in the text. Please clarify.

      We have revised the Results (line 98) to describe the UMI group size of 4.

      Reviewer #2 (Public review):

      (1) The application of UMI-based error correction to viral population sequencing has been established in previous studies (e.g., HIV), and this manuscript does not introduce a substantial methodological or conceptual advance beyond its use in the context of influenza.

      We appreciate the reviewer’s comment and agree that UMI-based error correction has been applied previously to viral population sequencing, including HIV. However, to our knowledge, relatively few studies have quantitatively evaluated both the performance of this method and the resulting within-quasi-species mutation distributions in detail. In our manuscript, we not only validate the accuracy of UMIbased error correction in the context of influenza virus sequencing, but also quantitatively characterize the features of intra-quasi-species distributions, which provides new insights into the mutational landscape and evolutionary dynamics specific to influenza. We therefore believe that our work goes beyond a simple application of an established method.

      (2) The study lacks independent biological replicates or additional viral systems that would strengthen the generalizability of the conclusions.

      We agree with the reviewer that the lack of independent biological replicates and additional viral systems limits the generalizability of our findings. In this study, we intentionally focused on single-particle–derived populations of influenza virus to establish a proof-of-principle for our sequencing and analytical framework. While this design provided a clear demonstration of the method’s ability to capture mutation distributions at the single-particle level, we acknowledge that additional biological replicates and testing across diverse viral systems would be necessary to confirm the broader applicability of our observations. Importantly, even within this limited framework, our analysis enabled us to draw conclusions at the level of individual viral populations and to suggest the possibility of comparing their mutation distributions with known evolvability. This highlights the potential of our approach to bridge observations from single particles with broader patterns of viral evolution. In future work, we plan to expand the number of populations analyzed and include additional viral systems, which will allow us to more rigorously assess reproducibility and to establish systematic links between mutation accumulation at the single-particle level and evolutionary dynamics across viruses.

      (3) Potential sources of technical error are not explored or explicitly controlled. Key methodological details are missing, including the number of PCR cycles, the input number of molecules, and UMI family size distributions.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407). In addition, we have added information on the estimated number of input molecules. Regarding the UMI family size distributions, we have added the data as Figure S2 and referred to it in the revised manuscript.

      Finally, with respect to potential sources of technical error, we note that this point is already addressed in the manuscript by direct comparison with in vitro transcribed RNA controls, which encompass errors introduced throughout the entire experimental process. This comparison demonstrates that the error-correction strategy employed here effectively reduces the impact of PCR or sequencing artifacts.

      (4) The assertion that variants at ≥0.1% frequency can be reliably detected is based on total read count rather than the number of unique input molecules. Without information on UMI diversity and family sizes, the detection limit cannot be reliably assessed.

      We thank the reviewer for raising this important issue. We agree that our original description was misleading, as the reliable detection limit should not be defined solely by total read count. In the revised version, we have added information on UMI distribution and family sizes (Figure S2), and we now state the detection limit in terms of consensus reads. Specifically, we define that variants can be reliably detected when ≥10,000 consensus reads are obtained with a group size of ≥3 (line 173). 

      (5)  Although genetic variation is described, the functional relevance of observed mutations in HA and NA is not addressed or discussed.

      We appreciate the reviewer’s suggestion. In our study, we did not apply drug or immune selection pressure; therefore, we did not expect to detect mutations that are already known to cause major antigenic changes in HA or NA, and we think it is difficult to discuss such functional implications in this context. However, as noted in discussion, we did identify drug resistance–associated mutations. This observation suggests that the quasi-species pool may provide functional variation, including resistance, even in the absence of explicit selective pressure. We have clarified this point in the text to better address the reviewer’s concern (line 330).

      (6) The experimental scale is small, with only four viral populations derived from single particles analyzed. This limited sample size restricts the ability to draw broader conclusions.

      We thank the reviewer for pointing out the limitation of analyzing only four viral populations derived from single particles. We fully acknowledge that the small sample size restricts the generalizability of our conclusions. Nevertheless, we would like to emphasize that even within this limited dataset, our results consistently revealed a slight but reproducible deviation of the mutation distribution from the Poisson expectation, as well as a weak correlation with inter-strain conservation. These recurring patterns highlight the robustness of our observations despite the sample size.

      In future work, we plan to expand the number of viral populations analyzed and to monitor mutation distributions during serial passage under defined selective pressures. We believe that such expanded analyses will enable us to more reliably assess how mutations accumulate and to develop predictive frameworks for viral evolution.

      Reviewer #1 (Recommendations for the authors):

      (1)  Please mention Figure 1 and S2 in the text.

      Done. We now explicitly reference Figures 1 and S2 (renamed to S1 according to appearance order) in the appropriate sections (lines 74, 124).

      (2)  In Figure 4A, please specify which graph corresponds to PB2 and which to PB2-like sequences.

      Corrected. Figure 4A legend now specify PB2 vs. PB2-like sequences.

      (3)  Consider reducing redundancy in lines 74, 149, 170, 214, and 215.

      We thank the reviewer for this stylistic suggestion. We have revised the text to reduce redundancy in these lines.

      Reviewer #2 (Recommendations for the authors):

      (1)  The manuscript states that "with 10,000 sequencing reads per gene ...variants at ≥0.1% frequency can be reliably detected." However, this interpretation conflates raw read counts with independent input molecules.

      We have revised this statement throughout the text to clarify that sensitivity depends on the number of unique UMIs rather than raw read counts (line 173). To support this, we calculated the probability of detecting a true variant present at a frequency of 0.1% within a population. When sequencing ≥10,000 unique molecules, such a variant would be observed at least twice with a probability of approximately 99.95%. In contrast, the error rate of in vitro–transcribed RNA, reflecting errors introduced during the experimental process, was estimated to be on the order of 10⁻⁶ (line 140, Fig. 3a). Under this condition, the probability that the same artificial error would arise independently at the same position in two out of 10,000 molecules is <0.5%. Therefore, variants present at ≥0.1% can be reliably distinguished from technical artifacts and are confidently detected under our sequencing conditions.

      (2) To support the claimed sensitivity, please provide for each gene and population: (a) UMI family size distributions, (b) number of PCR cycles and input molecule counts, and (c) recalculation of the detection limit based on unique molecules.

      If possible, I encourage experimental validation of sensitivity claims, such as spike-in controls at known variant frequencies, dilution series, or technical replicates to demonstrate reproducibility at the 0.1% detection level.

      We have added (a) histograms of UMI family size distributions for each gene and population (Figure S2), (b) detailed method RT-PCR protocol and estimated input counts (line 407), and (c) recalculated detection limits (line 173).

      We appreciate the reviewer’s suggestion and fully recognize the value of spike-in experiments. However, given the observed mutation rate of T7-derived RNA and the sufficient sequencing depth in our dataset, it is evident that variants above the 0.1% threshold can be robustly detected without additional spike-in controls.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Review report for 'Sterols regulate ciliary membrane dynamics and hedgehog signaling in health and disease', Lamazière et al.

      Reviewer #1

      In this manuscript, Lamazière et al. address an important understudied aspect of primary cilium biology, namely the sterol composition in the ciliary membrane. It is known that sterols especially play an important role in signal transduction between PTCH1 and SMO, two upstream components of the Hedgehog pathway, at the primary cilium. Moreover, several syndromes linked to cholesterol biosynthesis defects present clinical phenotypes indicative of altered Hh signal transduction. To understand the link between ciliary membrane sterol composition and Hh signal transduction in health and disease, the authors developed a method to isolate primary cilia from MDCK cells and coupled this to quantitative metabolomics. The results were validated using biophysical methods and cellular Hh signaling assays. While this is an interesting study, it is not clear from the presented data how general the findings are: can cilia be isolated from different mammalian cell types using this protocol? Is the sterol composition of MDCK cells expected to the be the same in fibroblasts or other cell types? Without this information, it is difficult to judge whether the conclusions reached in fibroblasts are indeed directly related to the sterol composition detected in MDCK cells. Below is a detailed breakdown of suggested textual changes and experimental validations to strengthen the conclusions of the manuscript.

      We would like to thank the reviewer for their helpful comments

      Major comments:

      • It appears that the comparison has been made between ciliary membranes and the rest of the cell's membranes, which includes many other membranes besides the plasma membrane. This significantly weakens the conclusions on the sterol content specific to the cilium, as it may in fact be highly similar to the rest of the plasma membrane. It is for example known that lathosterol is biosynthesized in the ER, and therefore the non-presence in the cilium may reflect a high abundance in the ER but not necessarily in the plasma membrane.

      The reviewer is correct that we compared the sterol composition of the primary ciliary membrane to the average of the remaining cellular membranes. We agree that this broader reference fraction contains multiple intracellular membranes, including ER- and Golgi-derived compartments, and therefore does not isolate the plasma membrane specifically. We would like to emphasize that our study did not aim to compare the cilium directly to the plasma membrane, nor did we claim that the comparison was in any way related to the plasma membrane. It is also worth noting that previous studies in other ciliated organisms have reported a higher cholesterol content in cilia compared to the plasma membrane, suggesting that the two membranes may not be compositionally identical despite their continuity. However, we concur that determining the sterol composition of the MDCK plasma membrane would provide valuable context and enable a comparison with the membrane continuous with the ciliary membrane. Hence, we are willing to try isolating plasma membrane in the same cellular contexts.

      • While the protocol to isolate primary cilium from MDCK cells is a valuable addition to the methods available, it would be good to at least include a discussion on its general applicability. Have the authors tried to use this protocol on fibroblasts for example?

      Thank you for the reviewer's positive comment on the value of the ciliary isolation protocol. Indeed, we have attempted to apply the same approach to other ciliated cell types, namely IMCD3 and MEF cells. In the case of IMCD3 cells, we were able to isolate primary cilia using the same general strategy; however, we are still refining the preparation, as the overall yield is lower than in MDCK cells and the amount of material obtained is currently insufficient for comprehensive biochemical analyses. With MEF (fibroblast) cells, the procedure proved even more challenging, as the yield of isolated cilia was extremely low. This difficulty is likely due to the shorter length of fibroblast cilia and to their positioning beneath the cell body, which probably makes them more resistant to detachment. Overall, these observations suggest that while the protocol can be adapted to other cell types, its efficiency depends on cellular architecture. We have added a discussion of these aspects in the revised manuscript to clarify the method's current scope and limitations (lines 492-502).

      • Some of the conclusions in the introduction (lines 75-80) seem to be incorrectly phrased based on the data: in basal conditions, ciliary membranes are already enriched in cholesterol and desmosterol, and the treatment lowers this in all membranes.

      We agree, this was modified in the revised manuscript (lines 75-80).

      • There seems to be little effect of simvastatin on overall cholesterol levels. Can the authors comment on this result? How would the membrane fluidity be altered when mimicking simvastatin-induced composition? Since the effect on Hh signaling appears to be the biggest (Figure 5B) under simvastatin treatment, it would be interesting to compare this against that found for AY9944 treatment. Also, the authors conclude that the effects of simvastatin treatment on ciliary membrane sterol composition are the mildest, however, one could argue that they are the strongest as there is a complete lack of desmosterol.

      We thank the reviewer for these insightful comments. Regarding the modest overall effect of simvastatin on cholesterol levels, we would like to note that MDCK cells are an immortalized epithelial cell line with high metabolic plasticity. Such cancer-like cell types are known to exhibit enhanced de novo lipogenesis, particularly under culture conditions with ample glucose availability. This compensatory lipid biosynthesis can partially counterbalance pharmacological inhibition of the cholesterol biosynthetic pathway. Because simvastatin acts upstream in the pathway (at HMG-CoA reductase), its inhibition primarily reduces early intermediates rather than fully depleting end-product cholesterol, explaining the relatively mild changes observed in total cholesterol content.

      Concerning desmosterol, we agree with the reviewer that its complete loss under simvastatin treatment is a striking finding that deserves further discussion. Interestingly, our data show that simvastatin treatment produces the strongest inhibition of pathway activation (as measured by SMO activation), but the weakest effect on signal transduction downstream of constitutively active SMOM2. This dichotomy suggests that the absence of desmosterol may preferentially affect the activation step of Hedgehog signaling at the ciliary membrane, without equally impacting downstream propagation. We have expanded the Result section to highlight this potential role of desmosterol in the activation phase of Hedgehog signaling and to contrast it with the effects observed under AY9944 treatment (lines 463-469).

      It is not clear to me why the authors have chosen to use SAG to activate the Hh pathway, as this is a downstream mode of activation and bypasses PTCH1 (and therefore a potentially sterol-mediated interaction between the two proteins). It would be very informative to compare the effect of sterol modulation on the ability of ShhN vs SAG to activate the pathway.

      Our study aims to demonstrate that the sterol composition of the ciliary membrane plays an essential role in the proper functioning of the Hedgehog (Hh) signaling pathway, comparable in importance to that of oxysterols and free cholesterol. Because ShhN itself is covalently modified by cholesterol, and Smoothened (SMO) can be directly activated by both oxysterols and cholesterol, we reasoned that using a non-native SMO agonist such as SAG would allow us to specifically assess defects arising from alterations in membrane-bound sterols. In this way, pathway activation by SAG provides a more direct readout of the functional contribution of ciliary membrane sterols to SMO activity, independent of potential confounding effects related to ShhN processing, secretion, or PTCH1-mediated regulation.

      • The conclusions about the effect of tamoxifen on SMO trafficking in MEFs should be validated in human patient cells before being able to conclude that there is a potential off-target effect (line 438). Also, if that is the case, the experiment of tamoxifen treatment of EBP KO cells should give an additional effect on SMO trafficking. Also, could the CDPX2 phenotypes in patients be the result of different cell types being affected than the fibroblast used in this study?

      We agree that carrying the proposed experiment would be a good way to assess a potential off-target effect. However, such validation is beyond the scope of the present study, as this comment on off-target effect was aimed primarily to propose a mechanistic hypothesis to explain the differences observed in Hedgehog pathway activation between patient-derived fibroblasts and tamoxifen-treated MEFs. We leaned towards this hypothesis because drug treatments are known for their overall variable specificity, but we agree other hypotheses are possible, and among them the difference in cell type, as both are fibroblasts but from different origin. We rephrased this passage in the revised manuscript (lines 447-448 ).

      Regarding the reviewer's third point, we fully agree that the CDPX2 phenotype in patients is unlikely to arise solely from fibroblast dysfunction. Nevertheless, fibroblasts are the only patient-derived cells currently available to us, and they provide a useful model for assessing ciliary signaling. It is reasonable to expect that similar defects could occur in other, more physiologically relevant cell types.

      • For the experiments with the SMO-M2 mutant, it would be useful to show the extent of pathway activation by the mutant compared to SAG or ShhN treatment of non-transfected cells. Moreover, it will be necessary to exclude any direct effects of the compound treatment on the ability of this mutant to traffic to the primary cilium, which can easily be done using fluorescence microscopy as the mutant is tagged with mCherry.

      The SmoM2 mutant is indeed a well-characterized constitutively active form of Smoothened that has been extensively studied by us and others. It is well established that this mutant correctly localizes to the primary cilium and robustly activates the Hedgehog pathway in MEFs (see Eguether et al., Dev. Cell, 2014 or Eguether et al, mol.biol.cell, 2018). In our study, we have already included supporting evidence for pathway activation in Supplementary Figure S1b, showing Gli1 expression levels in untreated MEFs transfected with SmoM2, which illustrates the extent of its activation compared to ligand-induced conditions.

      In line with the reviewer's recommendation, we will additionally include microscopy data showing SmoM2 localization in MEFs treated with the different sterol modulators. These data should confirm that the observed effects are not due to altered ciliary trafficking of the mutant protein but instead reflect changes in downstream signaling or membrane composition.

      Minor comments:

      Line 74: 'in patients', should be rephrased to 'patient-derived cells'

      This was modified in the revised manuscript

      Figure 2A: What do the '+/-' indicate? They seem to be erroneously placed.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Figure 2B: no label present for which bar represents cilia/other membranes

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Figure 2C: this representation is slightly deceptive, since the difference between cells and cilia for lanosterol is not significantly different as shown in figure 2A.

      This representation has been removed in the revised figures.

      Figure 3A: it would be useful to also show where 8-DHC is in the biosynthetic pathway.

      This has been modified in the revised figures.

      Line 373: the title should be rephrased as it infers that DHCR7 was blocked in model membranes, which is not the case.

      This has been modified in the revised manuscript.

      Lines 377-384: this paragraph seems to be a mix of methods and some explanation, but should be rephrased for clarity.

      We believe the technical information within this paragraph are useful for the understanding of the reader. We would rather leave as is unless recommended by other reviewers or editorial staff.

      Line 403: 'which could explain the resulting defects in Hedgehog signaling': how and what defects? At this point in the study no defects in Hh signaling have been shown.

      This has been modified in the revised manuscript.

      Figure 4D: 'd' is missing

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Line 408: SAG treatment resulted in slightly shorter cilia: this is not the case for just SAG treated cilia, but only for the combination of SAG + AY9944. However, in that condition there appears to be a subpopulation of very short cilia, are those real?

      This is correct, this is not the case for untreated cilia, but the short population is real, not only in AY9944 but also in Tamoxifen and Simvastatin. Again, the relevance and significance of minor cilia length change is unclear and we are not trying to draw any other conclusion from this than saying that the ciliary compartment is modified.

      Figure 5b: it would be good to add that all conditions contained SAG.

      This has been modified in the revised figures.

      Figure 5D: Since it is shown in Fig 5C that there are no positive cilia -SAG, there is no point to have empty graphs in Fig 5D on the left side, nor can any statistics be done. Similarly for 5K.

      We think this is still worth having in the figure. As the reviewer noted in one of his next comment, there are cases where Smoothened or Patched can be abnormally distributed (see also Eguether et al, mol biol cell, 2018). This shows that we checked all conditions for presence or absence of Smo and that there is no signal to be found. We would rather leave it as is unless asked otherwise by editorial staff.

      Figure 5E: it is not clearly indicated what is visualized in the inserts, sometimes it's a box, sometimes a line and they seem randomly integrated into the images.

      We apologize for the oversight - the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Figure 5H: is this the intensity in just SMO positive cilia? If yes, this should be indicated, and the line at '0' for WT-SAG should be removed. I am also surprised there is then ns found for WT vs SLO, since in WT there are no positive cilia, but in SLO there are a few, so it appears to be more of a black-white situation. Perhaps it would be useful to split the data from different experiments to see if it consistently the case that there is a low percentage of SMO positive cilia in SLO cells.

      Yes, as in the rest of figure 5, the fluorescence intensity of Smo is only taken into account in SMO positive cells. This is now indicated in figure legend (lines 890, 898, 903 ). As for Smo positive, this is a good suggestion. We checked and for cilia in non-activated SLO patients, there are 8 positive cilia over a total of 240 counted cilia, mainly from one of the experiments. We could remove the data or leave as is given that the result is not significant.

      Fig S1: panels are inverted compared to mentioning in the text.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Methods-pharmacological treatments: there appear to be large differences in concentrations chosen to treat MDCK versus MEF cells - can the authors comment on these choices and show that the enzymes are indeed inhibited at the indicated concentrations?

      We thank the reviewer for this important comment. The concentrations of the pharmacological treatments were optimized separately for MDCK and MEF cells based on cell-type-specific tolerance. For each compound, we used the highest concentration that produced no detectable cytotoxicity or morphological changes. These conditions ensured that the treatments were effective (as seen by changes in sterol composition in MDCK cilia and Hh pathway phenotypes in treated MEFs) and compatible with cell viability and ciliation. Although we did not directly assay enzymatic inhibition in each case, the selected concentrations are consistent with those previously reported to inhibit the targeted enzymes in similar cellular contexts.

      Compound

      Typical Concentration Range in Mammalian Cell Culture

      Typical Exposure Duration

      Example Cell Types

      Representative Peer-Reviewed References

      AY9944 (DHCR7 inhibitor)

      1-10 µM widely used; 1 µM for minimal on-target effects; 2.5-10 µM for robust sterol shifts

      24-72 h; some sterol studies up to several days

      HEK293, fibroblasts, neuronal cells, macrophages

      Kim et al., J Biol Chem, 2001 - used 1 µM in dose-response experiments.; Haas et al., Hum Mol Genet, 2007 - 1 µM in cell-based assays.; Recent macrophage sterol study - 2.5-10 µM to induce 7-DHC accumulation.

      Simvastatin (HMG-CoA reductase inhibitor)

      0.1-10 µM common; 1-10 µM most widely used for robust pathway inhibition

      24-72 h

      Diverse mammalian lines, including liver, fibroblasts, epithelial cells

      Bytautaite et al., Cells (2020) - discusses common in-vitro ranges (1-10 µM).; Mullen et al., 2011 - used 10 µM simvastatin, noting it is a standard in-vitro concentration.

      Tamoxifen (modulator of sterol metabolism)

      1-20 µM; 1-5 µM for mild/longer treatments; 10-20 µM in cancer/cilia signaling studies

      24-72 h (longer treatments often at 1-5 µM)

      MDCK, MEFs, MCF-7, diverse epithelial lines

      Schlottmann et al., Cells (2022) - used 5-25 µM in sterol-related cell studies.; MCF-7 literature - 0.1-1 µM for estrogenic signaling, higher (5-10 µM) for metabolic/sterol pathway effects.; Additional cancer cell work indicating similar ranges.

      This information has been clarified in the revised Methods section (lines 222-224).

      (optional): it would be interesting to include a gamma-tubulin staining on the cilium prep to see if there is indeed a presence of the basal body as suggested by the proteomics data.

      Thank you, we will try this.

      There are many spelling mistakes and inconsistencies throughout the manuscript and its figures (mix of French and English for example) so careful proofreading would be warranted. Moreover, there are many mentionings of 'Hedgehog defects' or 'Hedgehog-linked', where in fact it is a defect in or link to the Hedgehog pathway, not the protein itself. This should be corrected.

      We thank the reviewer for noting these issues. We apologize for the inconsistencies observed in the initial submission, as mentioned previously, some of the figures inadvertently included earlier versions, which may have contributed to the errors identified. All figures have now been carefully revised and updated in the resubmitted manuscript.

      Regarding the text, we are surprised to hear about the spelling inconsistencies, as the manuscript was professionally proofread prior to submission (documentation can be provided upon request). Nevertheless, we have conducted an additional round of thorough proofreading to ensure consistency throughout the text and figures.

      Finally, we have corrected all instances of "Hedgehog defects" or "Hedgehog-linked" to the more accurate phrasing "Hedgehog pathway defect" or "Hedgehog pathway-linked," as suggested by the reviewer throughout the manuscript.

      Reviewer #1 (Significance (Required)):

      The study of ciliary membrane composition is highly relevant to understand signal transduction in health and disease. As such, the topic of this manuscript is significant and timely. However, as indicated above, there are limitations to this study, most notably the comparison of ciliary membrane versus all cellular membranes (rather than the plasma membrane), which weakens the conclusions that can be drawn. Moreover, cell-type dependency should be more thoroughly addressed. There certainly is a methodological advance in the form of cilia isolation from MDCK cells, however, it is unclear how broadly applicable this is to other mammalian cell types.

      We would like to thank the reviewer for their helpful comments and we appreciate the reviewer's recognition of the relevance and timeliness of studying ciliary membrane composition in the context of signaling regulation. We fully acknowledge that our comparison was made between the primary ciliary membrane and the total cellular membrane fraction, which encompasses multiple intracellular membranes. Our intent, however, was to obtain a global overview of how the ciliary membrane differs from the average membrane environment within the cell, thereby highlighting features that are unique to the cilium as a signaling organelle. This approach provides valuable baseline information that complements, rather than replaces, future targeted comparisons with the plasma membrane. As mentioned in this reply, we aim at carrying out these experiments before publication. Regarding cell-type dependency, we concur that ciliary lipid composition may vary between cell types, reflecting differences in their functional specialization. Our method was intentionally established in MDCK cells, which are epithelial and highly ciliated, to ensure sufficient yield and reproducibility. We have initiated trials with other mammalian cell types, including IMCD3 and MEF cells, and while yields remain limited, preliminary results indicate that the approach is adaptable with further optimization. Thus, our current work establishes a robust and reproducible proof of concept in a mammalian model, providing the first detailed sterol fingerprint of a mammalian primary cilium.

      We believe this constitutes a significant methodological and conceptual advance, as it opens the way for systematic exploration of ciliary lipid composition across diverse mammalian systems and pathological contexts.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Overview Accumulating evidence suggests that sterols play critical roles in signal transduction within the primary cilium, perhaps most notably in the Hedgehog cascade. However, the precise sterol composition of the primary cilium, and how it may change under distinct biological conditions, remains unknown, in part because of the lack of reproducible, widely accepted procedures to purify primary cilia from mammalian cultured cells. In the present study, the authors have designed a method to isolate the cilium from the MDCK cells efficiently and then utilized this procedure in conjunction with mass spectrometry to systematically analyze the sterol composition of the ciliary membrane, which they then compare to the sterol composition of the cell body. By analyzing this sterol profiling. the authors claim that the cilium has a distinct sterol composition from the cell body, including higher levels of cholesterol and desmosterol but lower levels of 8-DHC and & Lathosterol. This manuscript further demonstrates that alteration of sterol composition within cilia modulates Hedgehog signaling. These results strengthen the link between dysregulated Hedgehog signaling and defects in cholesterol biosynthesis pathways, as observed in SLOS and CDPX2.

      While the ability to isolate primary cilia from cultured MDCK cells represents an important technical achievement, the central claim of the manuscript - that cilia have a different sterol composition from the cell body - is not adequately supported by the data, and more rigorous comparisons between the ciliary membrane and key organellar membranes (such as plasma membrane) are required to make this claim. Moreover, although the authors have repeatedly mention that the ciliary sterol composition is "tightly regulated" there is no evidence provided to support such claim. At best, the data suggest that the cilium and cell body may differ in sterol composition (though even that remains uncertain), but no underlying regulatory mechanisms are demonstrated. In addition, much of the 2nd half of the paper represents a rehash of experiments with sterol biosynthesis inhibitors that have already been published in the literature, making the conceptual advance modest at best. Lastly, the link between CDPX2 and defective Hedgehog signaling is tenuous.

      We would like to thank the reviewer for their helpful comments

      Major comments

      Figure 1. C) Although the isolation of cilium from the MDCK cells using dibucaine treatment seems to be very efficient, the quality control of their fractionation procedure to monitor the isolation is limited to a single western blot of the purified cilia vs. cell body samples, with no representative data shown from the sucrose gradient fractionation steps. Given that prior studies (including those from the Marshall lab cited in this manuscript) found that 1) sucrose gradient fractionation was essential to obtain relatively pure ciliary fractions, and 2) the ciliary fractions appear to spread over many sucrose concentrations in those prior studies , the authors should have included the comparison of the fractionation profile from the sucrose gradient while isolating the primary cilium. This additional information would have further clarified and supported the efficiency of their proposed method.

      We thank the reviewer for their insightful comments regarding the quality control of our ciliary fractionation. We would like to clarify several important methodological aspects that distinguish our approach from those used in the studies cited (including those from the Marshall lab). In the cited work, the authors used a continuous sucrose gradient ranging from 30 % to 45 %, which allowed visualization of the distribution of ciliary proteins across the gradient. In contrast, we employed a discontinuous sucrose gradient (25 % / 50 %) optimized for higher recovery and reproducibility in our hands. In our preparation, the primary cilia consistently localize at the interface between the 25 % and 50 % layers. We systematically collect five 1 mL fractions from this interface and use fractions 1-3 for downstream analyses, as fractions 4-5 are typically already depleted of ciliary material. This targeted collection ensures good enrichment and low contamination, while avoiding unnecessary dilution of the limited ciliary sample. We also note that the prior studies the reviewer refers to were optimized for proteomic analyses, and therefore used actin as a marker of contamination from the cell body. In our case, the downstream application is lipidomic profiling, for which such protein-based contamination markers are not directly informative, since no reliable lipid marker exists to differentiate between organelle membranes. For this reason, we limited the protein-level validation to a semi-quantitative assessment of ciliary enrichment using ARL13B Western blotting, which robustly reports the presence and enrichment of ciliary membranes. Finally, to complement this targeted validation, we performed proteomic analysis followed by Gene Ontology (GO) Enrichment Analysis using the PANTHER database. This analysis evaluates the overrepresentation of proteins associated with ciliary structures and functions relative to the background frequency in the Canis lupus familiaris proteome. The resulting enrichment profile confirms that the isolated material is highly enriched in ciliary components and somewhat depleted of non-ciliary contaminants, thereby serving as an unbiased and global assessment of sample specificity and purity. We believe that, together, these methodological choices provide a rigorous and quantitative validation of our fractionation efficiency and support the robustness of the cilia isolation protocol used in this study.

      1. D) The authors presented proteomic data for the peptides analyzed from the isolated cilia in the form of GO term analysis; however, they did not provide examples of different proteins enriched within their fractionation procedure, aside from Arl13b shown in the blot. Including a summary table with representative proteins identified in the isolated ciliary fraction, along with the relative abundance or percentage distribution of these proteins, would make the data more informative.

      We thank the reviewer for this valuable suggestion. As mentioned in the manuscript, our proteomic dataset includes numerous hallmark components of the cilium, such as 18 IFT proteins, 4 BBS proteins, and several Hedgehog pathway components (including SuFu and Arl13b), as well as axonemal (Tubulin, Kinesin, Dynein) and centrosomal proteins (Centrin, CEPs, γ-Tubulin, and associated factors). This composition demonstrates that the isolated fraction is highly enriched in bona fide ciliary components while retaining a small proportion of basal body proteins, which is expected given their physical continuity. Importantly, our dataset shows a 70% overlap with the ciliary proteome published by Ishikawa et al. and a 41% overlap with the CysCilia consortium's list of potential ciliary proteins, which supports both the specificity and reliability of our isolation procedure. Regarding the suggestion to present relative protein abundances, we would like to clarify that defining "relative to what" is challenging in this context. The stoichiometry of ciliary proteins is largely unknown, and relative abundance normalized to total protein content can be misleading, as ciliary structural and signaling components differ greatly in copy number and membrane association. For this reason, we chose to highlight in the text proteins such as BBS and IFTs, which are known to be of low abundance within the cilium; their detection supports the depth and specificity of our proteomic coverage. In addition, we performed an unbiased Gene Ontology (GO) Enrichment Analysis using the PANTHER database, which provides a systematic and quantitative overview of the biological processes and cellular components overrepresented in our dataset relative to the canine proteome. This analysis with regard to purity wa already discussed in the submitted manuscript discussion. To further address the reviewer's comment, we will include as a supplemental table in the revised manuscript, a summary table listing representative ciliary proteins identified in our fraction, including those overlapping with the CysCilia (Gold ans potential lists), CiliaCarta and Ishikawa/Marshall proteomes. This addition should make the dataset more transparent and informative while preserving scientific rigor.

      Figure 2.

      The authors represented the comparison of sterol content within the cilia versus whole cell (as cell membranes). Since different organelles have a very diverse degree of cholesterol contents within them, for instance plasma membrane itself is around 50 mol% cholesterol levels while organelles like ER have barely any cholesterol. Thus, comparing these two samples and claiming a 2.5-fold increase in cholesterol levels is misleading. A more appropriate comparison would be between isolated primary cilia and isolated plasma membranes (procedures to isolate plasma membranes have been described previously, e.g., Naito et al., eLife 2019; Das et al, PNAS 2013. The absence of such controls makes it difficult to fully validate the reported magnitude of sterols enrichment in cilia relative to the cell surface.

      As already discussed above for reviewer 1, we would like to emphasize that our study did not aim to compare the cilium directly to the plasma membrane, nor did we claim that the comparison was in any way related to the plasma membrane. Our intent, was to obtain a global overview of how the ciliary membrane differs from the average membrane environment within the cell, thereby highlighting features that are unique to the cilium as a signaling organelle. This approach provides valuable baseline information that complements, rather than replaces, future targeted comparisons with the plasma membrane. However, we concur that determining the sterol composition of the MDCK plasma membrane would provide valuable context and enable a comparison with the membrane continuous with the ciliary membrane. Hence, we are willing to try isolating plasma membrane in the same cellular contexts, and we thank the reviewer for the proposed literature.

      Also, because dibucaine was used here to isolate MDCK cilia, a control experiment to exclude possible effects of the dibucaine treatment on sterol biosynthesis would be helpful.

      Thank you for this comment, we will verify this point by quantifying by GC-MS the sterol content of whole MDCK cells with and without 15 minutes-dibucaine treatments.

      Figure 3.

      Tamoxifen is a potent drug for nuclear hormone receptor activity and thus can independently influence various cellular processes. As several experiments in the later sections of the manuscript rely on tamoxifen treatment of cells, it is important that the authors include appropriate controls for tamoxifen treatment, to confirm that the observed effects do not stem from effects on nuclear hormone receptor activity. This would ensure that the observed effects can be confidently attributed to the experimental manipulation rather than to the intrinsic effects of tamoxifen.

      The reviewer is right, tamoxifen, like many drugs, has pleiotropic effects in different cell processes. Aware of this possible issue, we turned to a genetic model creating a CRISPR-CAS9 mediated knock down of EBP, the enzyme targeted by tamoxifen. We showed in figure 5 that the results between tamoxifen treated cells and CRIPSR EBP cells were in accordance with one another, showing that, for hedgehog signaling, the effect of tamoxifen recapitulates the effect of the enzyme KO.

      Figure4. The authors present the results of spectroscopy studies to analyze generalized polarization (GP) of liposomes in vitro , but only processed data are shown, and the raw spectra are not provided. The authors need to present representative spectra to enable the readers to interact the raw data from the experiments.

      This has been added to new supplemental figure 1 and corresponding figure legend (lines 898-904)

      Figure5. B) The experiment shown Gli1 mRNA levels following treatment with inhibitors of cholesterol biosynthesis, but similar findings have already been reported previously (e.g., Cooper et al, Nature Genetics 2003; Blassberg et al, Hum Mol Genet 2016), and the present results do not provide a significant conceptual advance over those earlier studies.

      We thank the reviewer for this comment and for highlighting the importance of earlier studies on Hedgehog (Hh) signaling and cholesterol metabolism. While we fully agree that confirming and extending established findings has intrinsic scientific value, we respectfully disagree with the assertion that our work does not provide conceptual novelty.

      The seminal work by Cooper et al. (Nature Genetics, 2003) indeed laid the foundation for linking sterol metabolism to Hedgehog signaling, and we cite it as such. However, that study was conducted in chick embryos, a model that is relatively distant from mammalian systems and human pathophysiology. Moreover, their approach relied heavily on cyclodextrin-mediated cholesterol depletion, which is non-specific and extracts multiple sterols from membranes (discussed in this article lines 512-516). In contrast, our study employs pharmacological inhibitors targeting specific enzymes in the sterol biosynthetic pathway, thereby allowing us to modulate distinct steps and intermediates in a controlled and mechanistically informative manner. We also extend these analyses to patient-derived fibroblasts and CRISPR-engineered cells, providing direct human and genetic validation of the observed effects. Importantly, we complement these cellular studies with biochemical characterization of isolated ciliary membranes from MDCK cells, enabling a direct assessment of how specific sterol alterations affect ciliary composition and Hh pathway function - an angle not addressed in prior work.

      Regarding Blassberg et al. (Hum. Mol. Genet., 2016), we agree that part of our findings recapitulates their observations on SMO-related signaling defects, which we view as an important confirmation of reproducibility. However, their study primarily sought to distinguish whether Hh pathway impairment in SLOS results from 7-DHC accumulation or cholesterol depletion, concluding that cholesterol deficiency was the main cause. Our results expand on this by demonstrating that perturbations extend beyond these two sterols, and that additional intermediates in the biosynthetic pathway also impact ciliary membrane composition and signaling competence. Furthermore, our experiments using the constitutively active SmoM2 mutant show that Hh signaling defects are not restricted to SMO activation per se, revealing a broader disruption of the signaling machinery within the cilium.

      Finally, neither of the above studies examined CDPX2 patient-derived cells or the consequences of EBP enzyme deficiency on Hh signaling. Our finding that this pathway is altered in this genetic context represents, to our knowledge, a novel link between CDPX2 and Hedgehog pathway dysfunction.

      Taken together, our work builds upon and extends previous findings by integrating cell-type-specific, biochemical, and patient-based analyses to provide a more comprehensive and mechanistically detailed view of how sterol composition of the ciliary membrane regulates Hedgehog signaling.

      In addition, the authors analyze the effect of these inhibitors on SAG stimulation, but the experiment lacks the control for Gli mRNA levels in the absence of SAG treatment. Without this control, it is impossible to know where the baseline in the experiment is and how large the effects in question really are.

      Below, we provide the data expressed using the ΔΔCt method (NT + SAG normalized to NT - SAG), which more clearly illustrates the magnitude of the effect in question. As similar qPCR-based Hedgehog pathway activation assays in MEFs have been published previously (see Eguether et al., Dev. Cell 2014; Eguether et al., Mol. Biol. Cell 2018), our goal here was not to re-establish the assay itself but to highlight the comparative effects across experimental conditions. In addition, one of the datasets was obtained using a new batch of SAG, which exhibited stronger pathway activation across all conditions (visible as higher overall expression levels). To ensure valid statistical comparisons across experiments and to focus on relative rather than absolute activation, we therefore chose to present the data as fold change values, which provides a more robust and statistically consistent measure for cross-condition analysis.

      J-K) The data represented in these panels for SAG treatment as fraction of Smo and its fluorescence intensity for the same sample appears to be inconsistent between the two graphs. Under SAG treatment for EBP mutants shows higher Smo fluorescence intensity while Smo positive cilia seems to be less than the wild type control cells. If the number of Smo+ cilia (quantified by eye) differs between conditions, shouldn't the quantification of Smo intensity within cilia show a similar difference?

      We thank the reviewer for this careful observation. The apparent discrepancy arises because the two panels quantify different parameters. In panel (j), we counted the percentage of cilia positive for SMO (i.e., cilia in which SMO was detected above background). In contrast, panel (k) reports the fluorescence intensity of SMO, but this measurement was performed only within the SMO-positive cilia identified in panel (j). This distinction has now been explicitly clarified in the figure legend, as also suggested by Reviewer 1.

      Taken together, these two analyses indicate that although fewer cilia display detectable SMO accumulation in the EBP mutant cells, the amount of SMO present within those cilia that do recruit it is comparable to wild-type levels (as reflected by the non-significant difference in fluorescence intensity). This interpretation helps explain the partial functional preservation of Hedgehog signaling in this condition and contrasts with cases such as AY9944 treatment, where both the number of SMO-positive cilia and the SMO intensity are reduced.

      1. I) The rationale for using SmoM2 in the analysis of cholesterol metabolism-related diseases such as SLOS and CDPX2 is unclear. The SmoM2 variant is primarily associated with cancer rather than cholesterol biosynthesis defects and its relevance either of these disorders is not immediately apparent.

      We thank the reviewer for this pertinent observation. We fully agree that SmoM2 was originally identified as an oncogenic mutation and is not directly associated with cholesterol biosynthesis disorders. However, our rationale for using this mutant was mechanistic rather than pathological. SmoM2 is a constitutively active form of SMO that triggers pathway activation independently of upstream components such as PTCH1 or ligand-mediated regulation.

      By using SmoM2, we aimed to determine whether the signaling defects observed under conditions that alter sterol metabolism (e.g., treatment with AY9944 or tamoxifen) occur upstream or downstream of SMO activation. The results demonstrate that, even when SMO is constitutively active, the Hedgehog pathway remains impaired under AY9944 treatment-and to a lesser extent with tamoxifen-indicating that these sterol perturbations disrupt the pathway beyond the level of SMO activation itself. In contrast, cells treated with simvastatin maintain normal pathway responsiveness, reinforcing the specificity of this effect.

      This experiment is therefore central to our study, as it reveals that sterol imbalance can hinder Hedgehog signaling even in the presence of an active SMO, providing new insight into how membrane composition influences downstream signaling competence.

      Minor corrections

      1. Line 385 seems to be a bit confusing which mentions cilia were treated with AY9944 - do the authors mean that cells were been treated with the drugs before isolation of cilia, or were the purified cilia actually treated with the drugs?

      Thank you, this has been modified in the revised manuscript

      The authors should add proper label in Figure 2 panel b for the bars representing the cilia and cell membranes.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Panels in Figure S1 should be re-arranged according to the figure legend and figure reference in line 450.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Legend for the Figure S1b should be corrected as data sets in graph represents 7 points while technical replicates in legend shows 6 experimental values.

      Thank you, this has been modified in the revised manuscript

      The labels for drug in Figure 3 and 5 should be corrected from tamoxifene to tamoxifen and simvastatine to simvastatin.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      Reviewer #2 (Significance (Required)):

      In the present study, the authors have designed a method to isolate the cilium from the MDCK cells efficiently and then utilized this procedure in conjunction with mass spectrometry to systematically analyze the sterol composition of the ciliary membrane, which they then compare to the sterol composition of the cell body. By analyzing this sterol profiling. the authors claim that the cilium has a distinct sterol composition from the cell body, including higher levels of cholesterol and desmosterol but lower levels of 8-DHC and & Lathosterol. This manuscript further demonstrates that alteration of sterol composition within cilia modulates Hedgehog signaling. These results strengthen the link between dysregulated Hedgehog signaling and defects in cholesterol biosynthesis pathways, as observed in SLOS and CDPX2.

      While the ability to isolate primary cilia from cultured MDCK cells represents an important technical achievement, the central claim of the manuscript - that cilia have a different sterol composition from the cell body - is not adequately supported by the data, and more rigorous comparisons between the ciliary membrane and key organellar membranes (such as plasma membrane) are required to make this claim. Moreover, although the authors have repeatedly mention that the ciliary sterol composition is "tightly regulated" there is no evidence provided to support such claim. At best, the data suggest that the cilium and cell body may differ in sterol composition (though even that remains uncertain), but no underlying regulatory mechanisms are demonstrated. In addition, much of the 2nd half of the paper represents a rehash of experiments with sterol biosynthesis inhibitors that have already been published in the literature, making the conceptual advance modest at best. Lastly, the link between CDPX2 and defective Hedgehog signaling is tenuous.

      We thank the reviewer for this detailed summary and for acknowledging the technical advance represented by our method for isolating primary cilia from MDCK cells. However, we respectfully disagree with several aspects of the reviewer's assessment of our work.

      As we elaborated in our responses to earlier comments, particularly regarding Figure 5, we disagree with the characterization of part of our study as a "rehash", a somewhat derogatory word, of previously published experiments. Our approach differs from earlier studies by relying on specific pharmacological modulation of defined enzymes in the sterol biosynthesis pathway, rather than using non-specific agents such as cyclodextrins, and by linking these manipulations to direct biochemical measurements of ciliary sterol composition. This strategy allows, for the first time, a targeted and physiologically relevant examination of how specific sterol perturbations affect Hedgehog signaling.

      Regarding our statement that ciliary sterol composition is "tightly regulated," we acknowledge that we have not yet explored the underlying molecular mechanisms of this regulation. Nevertheless, the experimental evidence supporting this statement lies in the variation of ciliary sterol composition across multiple treatments that strongly perturb cellular sterols. Despite broad cellular changes, the ciliary sterol profile remains very resilient for some parameters, an observation that, in our view, strongly supports the idea of a selective or regulated process maintaining ciliary sterol identity. This conclusion does not depend on comparison with other membrane compartments.

      We also respectfully disagree that the observed differences between cilia and the cell body (which doesn't equal to plasma membrane) are "uncertain." The consistent enrichment in cholesterol and desmosterol, combined with the relative depletion in 8-DHC and lathosterol, were detected across independent replicates using robust lipidomic profiling and are statistically supported. These findings are, to our knowledge, the first quantitative demonstration of a sterol fingerprint specific to a mammalian cilium.

      Finally, while we agree that the mechanistic link between CDPX2 and defective Hedgehog signaling warrants further exploration, the data we present, combining pharmacological inhibition (tamoxifen), CRISPR-mediated EBP knockout, and SMOM2 activation assays, all consistently indicate a functional impairment of the Hedgehog pathway under EBP deficiency. This is further reinforced by clinical reports describing Hedgehog-related phenotypes in CDPX2 patients. We therefore believe that our work provides a solid experimental and conceptual basis for connecting EBP dysfunction to Hedgehog signaling defects.

      In summary, our study introduces a validated and reproducible method for mammalian cilia isolation, provides the first detailed sterol composition profile of primary cilia, and establishes a functional link between ciliary sterol imbalance and Hedgehog pathway modulation. We believe these findings represent a meaningful conceptual advance and a valuable resource for the field

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Lamaziere et al. describe an improved protocol for isolating primary cilia from MDCK cells for downstream lipidomics analysis. Using this protocol, they characterize sterol profile of MDCK cilia membrane under standard growth conditions and following pharmacological perturbations that are meant to mimic SLOS and CDPX2 disorders in humans. The authors then assess the impact of the same pharmacological manipulations on Shh pathway activity and validate their findings from these experiments using orthogonal genetic approaches. Major and minor concerns that require attention prior to publication are outlined below.

      We would like to thank the reviewer for their comments

      Major 1.Since the extent of contamination of the cilia preps with non-cilia membranes is unclear, and variability between replicates is not reported, it makes interpretation of changes in cilia membrane sterol composition in response to pharmacological manipulations somewhat difficult to interpret. Discussing reproducibility of cilia sterol composition between replicates (and including corresponding data) could alleviate these concerns to some extent.

      We thank the reviewer for this comment. We would like to clarify that variability between replicates is indeed reported throughout the manuscript. In Figures 2 and 3, all data are presented as mean {plus minus} SEM, as indicated in the figure legends. Specifically, the data in Figure 2 are derived from six independent experiments, reflecting the central dataset used for comparative analyses, while the data in Figure 3 are based on three independent experiments.

      We also note that the overall variability between replicates is low, further supporting the reproducibility of our ciliary sterol composition measurements. This consistency across independent biological replicates provides confidence that the differences observed between cilia and the cell body are robust and not due to stochastic contamination or technical variation.

      2.An abundant non-ciliary membrane protein (rather than GAPDH) may be a more appropriate loading control in Fig. 1C.

      This is a valuable comment and we will find a non-ciliary membrane protein to complement this experiment.

      3.Fig. 2b - which bar corresponds to cells and which one to cilia? What do numbers inside bars represent? Please label accordingly.

      We apologize for the oversight, the figures initially submitted with the manuscript inadvertently included some earlier versions, which explains several of the discrepancies noted by the reviewers. This issue has been corrected in the revised submission, and all figures have now been updated to reflect the finalized data.

      4.Fig. 3b-d, right panels - please define what numbers inside bars represent

      Thank you, this was done in the revised manuscript. The numbers are reports of absolute quantification.

      5.The font in Figs 2, 3, and 4 is very small and difficult to read. Please make the font and/or panels bigger to improve readability.

      We did our best to enlarge font despite space limitations, but we are willing to work with editorial staff to improve readability as suggested.

      6.It would help to have a diagram of the key steps in the cholesterol synthesis pathway for reference early in the paper rather than in figure 3.

      We thank the reviewer for his comment, but we don't understand why this would be helpful as we only use sterol modulators involving the pathway's enzyme in fig3. We are open to discussion with editorial staff about moving it up to fig2. If they feel this is needed

      7.The authors need to discuss why/how global inhibition of enzymes (e.g. via AY9944 treatment) in a cell could cause reduction in cholesterol levels only in the cilium and not in other cell membranes (see also point 1). Yet, tamoxifen treatment lowers cholesterol across the board.

      We thank the reviewer for these insightful comments. Regarding the modest overall effect of simvastatin on cholesterol levels, we would like to note that MDCK cells are an immortalized epithelial cell line with high metabolic plasticity. Such cancer-like cell types are known to exhibit enhanced de novo lipogenesis, particularly under culture conditions with ample glucose availability. This compensatory lipid biosynthesis can partially counterbalance pharmacological inhibition of the cholesterol biosynthetic pathway. Because simvastatin acts upstream in the pathway (at HMG-CoA reductase), its inhibition primarily reduces early intermediates rather than fully depleting end-product cholesterol, explaining the relatively mild changes observed in total cholesterol content. . This has been added in a new paragraph in the revised manuscript (lines 371-378).

      8.Fig. 5c, g, and j - statistical analyses are missing and need to be added in support of conclusions drawn in the text of the manuscript.

      Thank you, this has been done in the revised manuscript

      9.The decrease in the fraction of Smo+ cilia observed in EBP KO cells is mild (panel j, no statistics), and there is possibly a clone-specific effect here as well (statistical analysis is needed to determine if EBP139 is indeed different from WT and whether EBP139 and 141 are different from each other). Similarly, Smo fluorescence intensity after SAG treatment (panel k) is the same in WT and EBP KO cells, while there is a marked difference in intraciliary Smo intensity after tamoxifen treatment. The author's conclusion "...we were able to show that results with human cells aligned with our tamoxifen experiments" (line 436) should be modified to more accurately reflect the presented data. Ditto conclusions on lines 440-442, 530-531. In fact, it is the lack of Hh phenotypes in CDPX2 patients that is consistent with the EBP KO data presented in the paper.

      We thank the reviewer for this detailed comment. We have now performed the requested statistical analyses and incorporated them into the revised manuscript.

      The new analyses confirm that both EBP139 and EBP141 CRISPR KO clones show a statistically significant reduction in the fraction of Smo⁺ cilia compared to WT cells. They also reveal that the two clones differ significantly from each other, consistent with the expected clonal variability inherent to independently derived CRISPR lines.

      Despite this variability, several lines of evidence support our conclusion that the EBP KO phenotypes align with the effects observed after tamoxifen treatment:

      1- Directionally consistent reduction in Smo⁺ cilia:

      Although the magnitude of the decrease differs between clones, both clones display a significant reduction compared to WT, paralleling the reduction observed in tamoxifen-treated cells. This directional consistency is the key point for comparing pharmacological and genetic perturbations.

      2-Converging evidence from SmoM2 experiments:

      Tamoxifen treatment also reduces pathway output in the context of SmoM2 overexpression. This supports the interpretation that both EBP inhibition (tamoxifen) and EBP loss (CRISPR KO) impair Hedgehog signaling at the level of ciliary function, albeit more mildly than AY9944/SLOS-like perturbations.

      3-Interpretation of Smo intensity (panel k):

      As clarified in the revised text, the fluorescence intensities in panel K correspond only to cilia that are Smo-positive. The absence of a difference in intensity therefore does not contradict the observed reduction in the number of Smo⁺ cilia. Rather, it explains why the phenotype is milder than that observed for SLOS/AY9944: when Smo is able to enter the cilium, its enrichment level is comparable to WT.

      4- Clinical relevance for CDPX2:

      While Hedgehog-related phenotypes in CDPX2 patients may be milder or under-reported, several documented features, such as polydactyly (10% of cases), as well as syndactyly and clubfoot, are classically associated with ciliary/Hedgehog signaling defects. This clinical pattern is consistent with the milder yet detectable defects we observe in EBP KO cells.

      Minor •Line 310: 'intraflagellar' rather than 'intraciliary' transport particle B is a more conventional term

      We agree that intraflagellar is more conventional than intraciliary, but in this case, this is how the GO term is labeled in the database. In our opinion, it should stay as is.

      • Fig. 2c - typos in the color key, is grey meant to be "cells" and blue "cilia"? Individual panels are not referenced in the text

      This panel has been removed thanks to comment from reviewer 1 and 3 finding it misleading.

      • Lines 357-358: "Notably, AY9944 treatment led to a greater reduction in cholesterol content as well as a greater increase in 7-DHC and 8-DHC in cilia than in the other cell membranes" - the authors need to support this statement with appropriate statistical analysis

      We respectfully believe there may be a misunderstanding in the reviewer's concern. In all cases, our comparisons are made between treated vs. untreated conditions within each compartment (cell bulk vs. ciliary membrane), and the statistical significance of these differences is already reported as determined by a Mann-Whitney test. In every case, the changes observed are greater in cilia than in the cell body. The statement in the manuscript simply summarizes this quantitative observation. However, if the reviewer feels that an additional statistical test directly comparing the magnitude of the two compartment-specific changes would strengthen the claim, we are willing to include this analysis. Alternatively, if preferred, we can remove the sentence entirely, as the comparison is already clearly visible in Figure 3b.

      • Line 473 - unclear what is meant by "olfactory cilia are mainly sensory and not primary". Primary cilia are sensory.

      We agree, primary cilia are sensory, but still different from cilia belonging to sensory epithelia like retina photoreceptors or olfactory cilia. Nevertheless, this statement was modified in revised manuscript

      • Line 551: 'data not shown'. Please include the data that you would like to discuss or remove discussion of these data from the manuscript.

      The data is not shown because there is nothing to show, as we discussed in that sentence, use of cholesterol probe resulted in the disappearance of primary cilia altogether. We are willing to work with editorial staff to find a better way of expressing this idea.

      Reviewer #3 (Significance (Required)):

      Overall, the manuscript expands our knowledge of cilia membrane composition and reports an interesting link between SLOS and Shh signaling defects, which could at least in part explain SLOS patients' symptoms. The findings reported in the manuscript could be of interest to a broad audience of cell biologists and geneticists.

      We would like to thank the reviewer for his recognition of the importance of this work

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, participants completed two different tasks. A perceptual choice task in which they compared the sizes of pairs of items and a value-different task in which they identified the higher value option among pairs of items with the two tasks involving the same stimuli. Based on previous fMRI research, the authors sought to determine whether the superior frontal sulcus (SFS) is involved in both perceptual and value-based decisions or just one or the other. Initial fMRI analyses were devised to isolate brain regions that were activated for both types of choices and also regions that were unique to each. Transcranial magnetic stimulation was applied to the SFS in between fMRI sessions and it was found to lead to a significant decrease in accuracy and RT on the perceptual choice task but only a decrease in RT on the value-different task. Hierarchical drift-diffusion modelling of the data indicated that the TMS had led to a lowering of decision boundaries in the perceptual task and a lower of non-decision times on the value-based task. Additional analyses show that SFS covaries with model-derived estimates of cumulative evidence and that this relationship is weakened by TMS.

      Strengths:

      The paper has many strengths including the rigorous multi-pronged approach of causal manipulation, fMRI and computational modelling which offers a fresh perspective on the neural drivers of decision making. Some additional strengths include the careful paradigm design which ensured that the two types of tasks were matched for their perceptual content while orthogonalizing trial-to-trial variations in choice difficulty. The paper also lays out a number of specific hypotheses at the outset regarding the behavioural outcomes that are tied to decision model parameters and are well justified.

      Weaknesses:

      (1.1) Unless I have missed it, the SFS does not actually appear in the list of brain areas significantly activated by the perceptual and value tasks in Supplementary Tables 1 and 2. Its presence or absence from the list of significant activations is not mentioned by the authors when outlining these results in the main text. What are we to make of the fact that it is not showing significant activation in these initial analyses?

      You are right that the left SFS does not appear in our initial task-level contrasts. Those first analyses were deliberately agnostic to evidence accumulation (i.e., average BOLD by task, irrespective of trial-by-trial evidence). Consistent with prior work, SFS emerges only when we model the parametric variation in accumulated perceptual evidence.

      Accordingly, we ran a second-level GLM that included trial-wise accumulated evidence (aE) as a parametric modulator. In that analysis, the left SFS shows significant aE-related activity specifically during perceptual decisions, but not during value-based decisions (SVC in a 10-mm sphere around x = −24, y = 24, z = 36).

      To avoid confusion, we now:

      (i) explicitly separate and label the two analysis levels in the Results; (ii) state up front that SFS is not expected to appear in the task-average contrast; and (iii) add a short pointer that SFS appears once aE is included as a parametric modulator. We also edited Methods to spell out precisely how aE is constructed and entered into GLM2. This should make the logic of the two-stage analysis clearer and aligns the manuscript with the literature where SFS typically emerges only in parametric evidence models.

      (1.2) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only. I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We agree that both tasks require early sensory identification of the items, but the decision-relevant evidence differs by design (size difference vs. value difference), and our modelling is targeted at the evidence integration stage rather than initial identification.

      To address your concern empirically, we: (i) added session-wise plots of mean RTs showing a general speed-up across the experiment (now in the Supplement); (ii) fit a hierarchical DDM to jointly explain accuracy and RT. The DDM dissociates decision time (evidence integration) from non-decision time (encoding/response execution).

      After cTBS, perceptual decisions show a selective reduction of the decision boundary (lower accuracy, faster RTs; no drift-rate change), whereas value-based decisions show no change to boundary/drift but a decrease in non-decision time, consistent with faster sensorimotor processing or task familiarity. Thus, the TMS effect in SFS is specific to the criterion for perceptual evidence accumulation, while the RT speed-up in the value task reflects decision-irrelevant processes. We now state this explicitly in the Results and add the RT-by-run figure for transparency.

      (1.2.1) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only.

      Thank you for prompting this clarification.

      The key point is what changes with cTBS. If SFS supported generic identification, we would expect parallel cTBS effects on drift rate (or boundary) in both tasks. Instead, we find: (a) boundary decreases selectively in perceptual decisions (consistent with SFS setting the amount of perceptual evidence required), and (b) non-decision time decreases selectively in the value task (consistent with speed-ups in encoding/response stages). Moreover, trial-by-trial SFS BOLD predicts perceptual accuracy (controlling for evidence), and neural-DDM model comparison shows SFS activity modulates boundary, not drift, during perceptual choices.

      Together, these converging behavioral, computational, and neural results argue that SFS specifically supports the criterion for perceptual evidence accumulation rather than generic visual identification.

      (1.2.2) I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We now more explicitly justify the two-level fMRI approach. The task-average contrast addresses which networks are generally more engaged by each domain (e.g., posterior parietal for PDM; vmPFC/PCC for VDM), given identical stimuli and motor outputs. This complements, but does not substitute for, the parametric evidence analysis, which is where one expects accumulation-related regions such as SFS to emerge. We added text clarifying that the first analysis establishes domain-specific recruitment at the task level, whereas the second isolates evidence-dependent signals (aE) and reveals that left SFS tracks accumulated evidence only for perceptual choices. We also added explicit references to the literature using similar two-step logic and noted that SFS typically appears only in parametric evidence models.

      (1.3) TMS led to reductions in RT in the value-difference as well as the perceptual choice task. DDM modelling indicated that in the case of the value task, the effect was attributable to reduced non-decision time which the authors attribute to task learning. The reasoning here is a little unclear.

      (1.3.1) Comment: If task learning is the cause, then why are similar non-decision time effects not observed in the perceptual choice task?

      Great point. The DDM addresses exactly this: RT comprises decision time (DT) plus non-decision time (nDT). With cTBS, PDM shows reduced DT (via a lower boundary) but stable nDT; VDM shows reduced nDT with no change to boundary/drift. Hence, the superficially similar RT speed-ups in both tasks are explained by different latent processes: decision-relevant in PDM (lower criterion → faster decisions, lower accuracy) and decision-irrelevant in VDM (faster encoding/response). We added explicit language and a supplemental figure showing RT across runs, and we clarified in the text that only the PDM speed-up reflects a change to evidence integration.

      (1.3.2) Given that the value-task actually requires perceptual decision-making, is it not possible that SFS disruption impacted the speed with which the items could be identified, hence delaying the onset of the value-comparison choice?

      We agree there is a brief perceptual encoding phase at the start of both tasks. If cTBS impaired visual identification per se, we would expect longer nDT in both tasks or a decrease in drift rate. Instead, nDT decreases in the value task and is unchanged in the perceptual task; drift is unchanged in both. Thus, cTBS over SFS does not slow identification; rather, it lowers the criterion for perceptual accumulation (PDM) and, separately, we observe faster non-decision components in VDM (likely familiarity or motor preparation). We added a clarifying sentence noting that item identification was easy and highly overlearned (static, large food pictures), and we cite that nDT is the appropriate locus for identification effects in the DDM framework; our data do not show the pattern expected of impaired identification.

      (1.4) The sample size is relatively small. The authors state that 20 subjects is 'in the acceptable range' but it is not clear what is meant by this.

      We have clarified what we mean and provided citations. The sample (n = 20) matches or exceeds many prior causal TMS/fMRI studies targeting perceptual decision circuitry (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021). Importantly, we (i) use within-subject, pre/post cTBS differences-in-differences with matched tasks; (ii) estimate hierarchical models that borrow strength across participants; and (iii) converge across behavior, latent parameters, regional BOLD, and connectivity. We now replace the vague phrase with a concrete statement and references, and we report precision (HDIs/SEs) for all main effects.

      Reviewer #2 (Public Review):

      Summary:

      The authors set out to test whether a TMS-induced reduction in excitability of the left Superior Frontal Sulcus influenced evidence integration in perceptual and value-based decisions. They directly compared behaviour - including fits to a computational decision process model - and fMRI pre and post-TMS in one of each type of decision-making task. Their goal was to test domain-specific theories of the prefrontal cortex by examining whether the proposed role of the SFS in evidence integration was selective for perceptual but not value-based evidence.

      Strengths:

      The paper presents multiple credible sources of evidence for the role of the left SFS in perceptual decision-making, finding similar mechanisms to prior literature and a nuanced discussion of where they diverge from prior findings. The value-based and perceptual decision-making tasks were carefully matched in terms of stimulus display and motor response, making their comparison credible.

      Weaknesses:

      (2.1) More information on the task and details of the behavioural modelling would be helpful for interpreting the results.

      Thank you for this request for clarity. In the revision we explicitly state, up front, how the two tasks differ and how the modelling maps onto those differences.

      (1) Task separability and “evidence.” We now define task-relevant evidence as size difference (SD) for perceptual decisions (PDM) and value difference (VD) for value-based decisions (VDM). Stimuli and motor mappings are identical across tasks; only the evidence to be integrated changes.

      (2) Behavioural separability that mirrors task design. As reported, mixed-effects regressions show PDM accuracy increases with SD (β=0.560, p<0.001) but not VD (β=0.023, p=0.178), and PDM RTs shorten with SD (β=−0.057, p<0.001) but not VD (β=0.002, p=0.281). Conversely, VDM accuracy increases with VD (β=0.249, p<0.001) but not SD (β=0.005, p=0.826), and VDM RTs shorten with VD (β=−0.016, p=0.011) but not SD (β=−0.003, p=0.419).

      (3 How the HDDM reflects this. The hierarchical DDM fits the joint accuracy–RT distributions with task-specific evidence (SD or VD) as the predictor of drift. The model separates decision time from non-decision time (nDT), which is essential for interpreting the different RT patterns across tasks without assuming differences in the accumulation process when accuracy is unchanged.

      These clarifications are integrated in the Methods (Experimental paradigm; HDDM) and in Results (“Behaviour: validity of task-relevant pre-requisites” and “Modelling: faster RTs during value-based decisions is related to non-decision-related sensorimotor processes”).

      (2.2) The evidence for a choice and 'accuracy' of that choice in both tasks was determined by a rating task that was done in advance of the main testing blocks (twice for each stimulus). For the perceptual decisions, this involved asking participants to quantify a size metric for the stimuli, but the veracity of these ratings was not reported, nor was the consistency of the value-based ones. It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear. More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We thank the reviewer for raising these concerns, and we address each of them point by point:

      2.2.1. Comment: It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear.

      That is correct—we used participants’ area/size ratings to construct perceptual evidence (SD).

      To validate this choice, we compared those ratings against an objective image-based size measure (proportion of non-black pixels within the bounding box). As shown in Author response image 3, perceptual size ratings are highly correlated with objective size across participants (Pearson r values predominantly ≈0.8 or higher; all p<0.001). Importantly, value ratings do not correlate with objective size (Author response image 2), confirming that the two rating scales capture distinct constructs. These checks support using participants’ size ratings as the participant-specific ground truth for defining SD in the PDM trials.

      Author response image 1.

      Objective size and value ratings are unrelated. Scatterplots show, for each participant, the correlation between objective image size (x-axis; proportion of non-black pixels within the item box) and value-based ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two value-rating repetitions). Across participants, value ratings do not track objective size, confirming that value and size are distinct constructs.

      Author response image 2.

      Perceptual size ratings closely track objective size. Scatterplots show, for each participant, the correlation between objective image size (x-axis) and perceptual area/size ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two perceptual ratings). Perceptual ratings are strongly correlated with objective size for nearly all participants (see main text), validating the use of these ratings to construct size-difference evidence (SD).

      (2.2.2) More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We agree that accuracy appears to asymptote in VDM, but the DDM fits indicate that the drift rate still increases monotonically with evidence in both tasks. In Supplementary figure 11, drift (δ) rises across the four evidence levels for PDM and for VDM (panels showing all data and pre/post-TMS). The apparent plateau in proportion correct during VDM reflects higher choice variability at stronger preference differences, not a failure of the drift–evidence mapping. Crucially, the model captures both the accuracy patterns and the RT distributions (see posterior predictive checks in Supplementary figures 11-16), indicating that a monotonic evidence–drift relation is sufficient to account for the data in each task.

      Author response image 3.

      HDDM parameters by evidence level. Group-level posterior means (± posterior SD) for drift (δ), boundary (α), and non-decision time (τ) across the four evidence levels, shown (a) collapsed across TMS sessions, (b) for PDM (blue) pre- vs post-TMS (light vs dark), and (c) for VDM (orange) pre- vs post-TMS. Crucially, drift increases monotonically with evidence in both tasks, while TMS selectively lowers α in PDM and reduces τ in VDM (see Supplementary Tables for numerical estimates).

      (2.3) The paper provides very little information on the model fits (no parameter estimates, goodness of fit values or simulated behavioural predictions). The paper finds that TMS reduced the decision bound for perceptual decisions but only affected non-decision time for value-based decisions. It would aid the interpretation of this finding if the relative reliability of the fits for the two tasks was presented.

      We appreciate the suggestion and have made the quantitative fit information explicit:

      (1) Parameter estimates. Group-level means/SDs for drift (δ), boundary (α), and nDT (τ) are reported for PDM and VDM overall, by evidence level, pre- vs post-TMS, and per subject (see Supplementary Tables 8-11).

      (2) Goodness of fit and predictive adequacy. DIC values accompany each fit in the tables. Posterior predictive checks demonstrate close correspondence between simulated and observed accuracy and RT distributions overall, by evidence level, and across subjects (Supplementary Figures 11-16).

      Together, these materials document that the HDDM provides reliable fits in both tasks and accurately recovers the qualitative and quantitative patterns that underlie our inferences (reduced α for PDM only; selective τ reduction in VDM).

      (2.4) Behaviourally, the perceptual task produced decreased response times and accuracy post-TMS, consistent with a reduced bound and consistent with some prior literature. Based on the results of the computational modelling, the authors conclude that RT differences in the value-based task are due to task-related learning, while those in the perceptual task are 'decision relevant'. It is not fully clear why there would be such significantly greater task-related learning in the value-based task relative to the perceptual one. And if such learning is occurring, could it potentially also tend to increase the consistency of choices, thereby counteracting any possible TMS-induced reduction of consistency?

      Thank you for pointing out the need for a clearer framing. We have removed the speculative label “task-related learning” and now describe the pattern strictly in terms of the HDDM decomposition and neural results already reported:

      (1) VDM: Post-TMS RTs are faster while accuracy is unchanged. The HDDM attributes this to a selective reduction in non-decision time (τ), with no change in decision-relevant parameters (α, δ) for VDM (see Supplementary Figure 11 and Supplementary Tables). Consistent with this, left SFS BOLD is not reduced for VDM, and trialwise SFS activity does not predict VDM accuracy—both observations argue against a change in VDM decision formation within left SFS.

      (2) PDM: Post-TMS accuracy decreases and RTs shorten, which the HDDM captures as a lower decision boundary (α) with no change in drift (δ). Here, left SFS BOLD scales with accumulated evidence and decreases post-TMS, and trialwise SFS activity predicts PDM accuracy, all consistent with a decision-relevant effect in PDM.

      Regarding the possibility that faster VDM RTs should increase choice consistency: empirically, consistency did not change in VDM, and the HDDM finds no decision-parameter shifts there. Thus, there is no hidden counteracting increase in VDM accuracy that could mask a TMS effect—the absence of a VDM accuracy change is itself informative and aligns with the modelling and fMRI.

      Reviewer #3 (Public Review):

      Summary:

      Garcia et al., investigated whether the human left superior frontal sulcus (SFS) is involved in integrating evidence for decisions across either perceptual and/or value-based decision-making. Specifically, they had 20 participants perform two decision-making tasks (with matched stimuli and motor responses) in an fMRI scanner both before and after they received continuous theta burst transcranial magnetic stimulation (TMS) of the left SFS. The stimulation thought to decrease neural activity in the targeted region, led to reduced accuracy on the perceptual decision task only. The pattern of results across both model-free and model-based (Drift diffusion model) behavioural and fMRI analyses suggests that the left SLS plays a critical role in perceptual decisions only, with no equivalent effects found for value-based decisions. The DDM-based analyses revealed that the role of the left SLS in perceptual evidence accumulation is likely to be one of decision boundary setting. Hence the authors conclude that the left SFS plays a domain-specific causal role in the accumulation of evidence for perceptual decisions. These results are likely to add importance to the literature regarding the neural correlates of decision-making.

      Strengths:

      The use of TMS strengthens the evidence for the left SFS playing a causal role in the evidence accumulation process. By combining TMS with fMRI and advanced computational modelling of behaviour, the authors go beyond previous correlational studies in the field and provide converging behavioural, computational, and neural evidence of the specific role that the left SFS may play.

      Sophisticated and rigorous analysis approaches are used throughout.

      Weaknesses:

      (3.1) Though the stimuli and motor responses were equalised between the perception and value-based decision tasks, reaction times (according to Figure 1) and potential difficulty (Figure 2) were not matched. Hence, differences in task difficulty might represent an alternative explanation for the effects being specific to the perception task rather than domain-specificity per se.

      We agree that RTs cannot be matched a priori, and we did not intend them to be. Instead, we equated the inputs to the decision process and verified that each task relied exclusively on its task-relevant evidence. As reported in Results—Behaviour: validity of task-relevant pre-requisites (Fig. 1b–c), accuracy and RTs vary monotonically with the appropriate evidence regressor (SD for PDM; VD for VDM), with no effect of the task-irrelevant regressor. This separability check addresses differences in baseline RTs by showing that, for both tasks, behaviour tracks evidence as designed.

      To rule out a generic difficulty account of the TMS effect, we relied on the within-subject differences-in-differences (DID) framework described in Methods (Differences-in-differences). The key Task × TMS interaction compares the pre→post change in PDM with the pre→post change in VDM while controlling for trialwise evidence and RT covariates. Any time-on-task or unspecific difficulty drift shared by both tasks is subtracted out by this contrast. Using this specification, TMS selectively reduced accuracy for PDM but not VDM (Fig. 3a; Supplementary Fig. 2a,c; Supplementary Tables 5–7).

      Finally, the hierarchical DDM (already in the paper) dissociates latent mechanisms. The post-TMS boundary reduction appears only in PDM, whereas VDM shows a change in non-decision time without a decision-relevant parameter change (Fig. 3c; Supplementary Figs. 4–5). If unmatched difficulty were the sole driver, we would expect parallel effects across tasks, which we do not observe.

      (3.2) No within- or between-participants sham/control TMS condition was employed. This would have strengthened the inference that the apparent TMS effects on behavioural and neural measures can truly be attributed to the left SFS stimulation and not to non-specific peripheral stimulation and/or time-on-task effects.

      We agree that a sham/control condition would further strengthen causal attribution and note this as a limitation. In mitigation, our design incorporates several safeguards already reported in the manuscript:

      · Within-subject pre/post with alternating task blocks and DID modelling (Methods) to difference out non-specific time-on-task effects.

      · Task specificity across levels of analysis: behaviour (PDM accuracy reduction only), computational (boundary reduction only in PDM; no drift change), BOLD (reduced left-SFS accumulated-evidence signal for PDM but not VDM; Fig. 4a–c), and functional coupling (SFS–occipital PPI increase during PDM only; Fig. 5).

      · Matched stimuli and motor outputs across tasks, so any peripheral sensations or general arousal effects should have influenced both tasks similarly; they did not.

      Together, these converging task-selective effects reduce the likelihood that the results reflect non-specific stimulation or time-on-task. We will add an explicit statement in the Limitations noting the absence of sham/control and outlining it as a priority for future work.

      (3.3) No a priori power analysis is presented.

      We appreciate this point. Our sample size (n = 20) matched prior causal TMS and combined TMS–fMRI studies using similar paradigms and analyses (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021), and was chosen a priori on that basis and the practical constraints of cTBS + fMRI. The within-subject DID approach and hierarchical modelling further improve efficiency by leveraging all trials.

      To address the reviewer’s request for transparency, we will (i) state this rationale in Methods—Participants, and (ii) ensure that all primary effects are reported with 95% CIs or posterior probabilities (already provided for the HDDM as pmcmcp_{\mathrm{mcmc}}pmcmc). We also note that the design was sensitive enough to detect RT changes in both tasks and a selective accuracy change in PDM, arguing against a blanket lack of power as an explanation for null VDM accuracy effects. We will nevertheless flag the absence of a formal prospective power analysis in the Limitations.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations For The Authors):

      Some important elements of the methods are missing. How was the site for targeting the SFS with TMS identified? The methods described how M1 was located but not SFS.

      Thank you for catching this omission. In the revised Methods we explicitly describe how the left SFS target was localized. Briefly, we used each participant’s T1-weighted anatomical scan and frameless neuronavigation to place a 10-mm sphere at the a priori MNI coordinates (x = −24, y = 24, z = 36) derived from prior work (Heekeren et al., 2004; Philiastides et al., 2011). This sphere was transformed to native space for each participant. The coil was positioned tangentially with the handle pointing posterior-lateral, and coil placement was continuously monitored with neuronavigation throughout stimulation. (All of these procedures mirror what we already report for M1 and are now stated for SFS as well.)

      Where to revise the manuscript:

      Methods → Stimulation protocol. After the first sentence naming cTBS, insert:<br /> “The left SFS target was localized on each participant’s T1-weighted anatomical image using frameless neuronavigation. A 10-mm radius sphere was centered at the a priori MNI coordinates x = −24, y = 24, z = 36 (Heekeren et al., 2004; Philiastides et al., 2011), then transformed to native space. The MR-compatible figure-of-eight coil was positioned tangentially over the target with the handle oriented posterior-laterally, and its position was tracked and maintained with neuronavigation during stimulation.”

      It is not clear how participants were instructed that they should perform the value-difference task. Were they told that they should choose based on their original item value ratings or was it left up to them?

      We agree the instruction should be explicit. Participants were told_: “In value-based blocks, choose the item you would prefer to eat at the end of the experiment.”_ They were informed that one VDM trial would be randomly selected for actual consumption, ensuring incentive-compatibility. We did not ask them to recall or follow their earlier ratings; those ratings were used only to construct evidence (value difference) and to define choice consistency offline.

      Where to revise the manuscript:

      Methods → Experimental paradigm.

      Add a sentence to the VDM instruction paragraph:

      “In value-based (LIKE) blocks, participants were instructed to choose the item they would prefer to consume at the end of the experiment; one VDM trial was randomly selected and implemented, making choices incentive-compatible. Prior ratings were used solely to construct value-difference evidence and to score choice consistency; participants were not asked to recall or match their earlier ratings.”

      Line 86 Introduction, some previous studies were conducted on animals. Why it is problematic that the studies were conducted in animals is not stated. I assume the authors mean that we do not know if their findings will translate to the human brain? I think in fairness to those working with animals it might be worth an extra sentence to briefly expand on this point.

      We appreciate this and will clarify that animal work is invaluable for circuit-level causality, but species differences and putative non-homologous areas (e.g., human SFS vs. rodent FOF) limit direct translation. Our point is not that animal studies are problematic, but that establishing causal roles in humans remains necessary.

      Revision:

      Introduction (paragraph discussing prior animal work). Replace the current sentence beginning “However, prior studies were largely correlational”

      “Animal studies provide critical causal insights, yet direct translation to humans can be limited by species-specific anatomy and potential non-homologies (e.g., human SFS vs. frontal orienting fields in rodents). Therefore, establishing causal contributions in the human brain remains essential.”

      Line 100-101: "or whether its involvement is peripheral and merely functionally supporting a larger system" - it is not clear what you mean by 'supporting a larger system'

      We meant that observed SFS activity might reflect upstream/downstream support processes (e.g., attentional control or working-memory maintenance) rather than the computation of evidence accumulation itself. We have rephrased to avoid ambiguity.

      Revision:

      Introduction. Replace the phrase with:

      “or whether its observed activity reflects upstream or downstream support processes (e.g., attention or working-memory maintenance) rather than the accumulation computation per se.”

      The authors do have to make certain assumptions about the BOLD patterns that would be expected of an evidence accumulation region. These assumptions are reasonable and have been adopted in several previous neuroimaging studies. Nevertheless, it should be acknowledged that alternative possibilities exist and this is an inevitable limitation of using fMRI to study decision making. For example, if it turns out that participants collapse their boundaries as time elapses, then the assumption that trials with weaker evidence should have larger BOLD responses may not hold - the effect of more prolonged activity could be cancelled out by the lower boundaries. Again, I think this is just a limitation that could be acknowledged in the Discussion, my opinion is that this is the best effort yet to identify choice-relevant regions with fMRI and the authors deserve much credit for their rigorous approach.

      Agreed. We already ground our BOLD regressors in the DDM literature, but acknowledge that alternative mechanisms (e.g., time-dependent boundaries) can alter expected BOLD–evidence relations. We now add a short limitation paragraph stating this explicitly.

      Revision:

      Discussion (limitations paragraph). Add:

      “Our fMRI inferences rest on model-based assumptions linking accumulated evidence to BOLD amplitude. Alternative mechanisms—such as time-dependent (collapsing) boundaries—could attenuate the prediction that weaker-evidence trials yield longer accumulation and larger BOLD signals. While our behavioural and neural results converge under the DDM framework, we acknowledge this as a general limitation of model-based fMRI.”

      Reviewer #2 (Recommendations For The Authors):

      Minor points

      I suggest the proportion of missed trials should be reported.

      Thank you for the suggestion. In our preprocessing we excluded trials with no response within the task’s response window and any trials failing a priori validity checks. Because non-response trials contain neither a choice nor an RT, they are not entered into the DDM fits or the fMRI GLMs and, by design, carry no weight in the reported results. To keep the focus on the data that informed all analyses, we now (i) state the trial-inclusion criteria explicitly and (ii) report the number of analysed (valid) trials per task and run. This conveys the effective sample size contributing to each condition without altering the analysis set.

      Revision:

      Methods → (at the end of “Experimental paradigm”): “Analyses were conducted on valid trials only, defined as trials with a registered response within the task’s response window and passing pre-specified validity checks; trials without a response were excluded and not analysed.”

      Results → “Behaviour: validity of task-relevant pre-requisites” (add one sentence at the end of the first paragraph): “All behavioural and fMRI analyses were performed on valid trials only (see Methods for inclusion criteria).”

      Figure 4 c is very confusing. Is the legend or caption backwards?

      Thanks for flagging. We corrected the Figure 4c caption to match the colouring and contrasts used in the panel (perceptual = blue/green overlays; value-based = orange/red; ‘post–pre’ contrasts explicitly labeled). No data or analyses were changed, just the wording to remove ambiguity.

      Revision:

      Figure 4 caption (panel c sentence). Replace with:

      “(c) Post–pre contrasts for the trialwise accumulated-evidence regressor show reduced left-SFS BOLD during perceptual decisions (green overlay), with a significantly stronger reduction for perceptual vs value-based decisions (blue overlay). No reduction is observed for value-based decisions.”

      Even if not statistically significant it may be of interest to add the results for Value-based decision making on SFS in Supplementary Table 3.

      Done. We now include the SFS small-volume results for VDM (trialwise accumulated-evidence regressor) alongside the PDM values in the same table, with exact peak, cluster size, and statistics.

      Revision:

      Supplementary Table 3 (title):

      “Regions encoding trialwise accumulated evidence (parametric modulation) during perceptual and value-based decisions, including SFS SVC results for both tasks.”

      Model comparisons: please explain how model complexity is accounted for.

      We clarify that model evidence was compared using the Deviance Information Criterion (DIC), which penalizes model fit by an effective number of parameters (pD). Lower DIC indicates better out-of-sample predictive performance after accounting for model complexity.

      Revision:

      Methods → Hierarchical Bayesian neural-DDM (last paragraph). Add:

      “Model comparison used the Deviance Information Criterion (DIC = D̄ + pD), where pD is the effective number of parameters; thus DIC penalizes model complexity. Lower DIC denotes better predictive accuracy after accounting for complexity.”

      Reviewer #3 (Recommendations For The Authors):

      The following issues would benefit from clarification in the manuscript:

      - It is stated that "Our sample size is well within acceptable range, similar to that of previous TMS studies." The sample size being similar to previous studies does not mean it is within an acceptable range. Whether the sample size is acceptable or not depends on the expected effect size. It is perfectly possible that the previous studies cited were all underpowered. What implications might the lack of an a priori power analysis have for the interpretation of the results?

      We agree and have revised our wording. We did not conduct an a priori power analysis. Instead, we relied on a within-participant design that typically yields higher sensitivity in TMS–fMRI settings and on convergence across behavioural, computational, and neural measures. We now acknowledge that the absence of formal power calculations limits claims about small effects (particularly for null findings in VDM), and we frame those null results cautiously.

      Revision:

      Discussion (limitations). Add:

      “The within-participant design enhances statistical sensitivity, yet the absence of an a priori power analysis constrains our ability to rule out small effects, particularly for null results in VDM.”

      - I was confused when trying to match the results described in the 'Behaviour: validity of task-relevant pre-requisites' section on page 6 to what is presented in Figure 1. Specifically, Figure 1C is cited 4 times but I believe two of these should be citing Figure 1B?

      Thank you—this was a citation mix-up. The two places that referenced “Fig. 1C” but described accuracy should in fact point to Fig. 1B. We corrected both citations.

      Revision:

      Results → Behaviour: validity… Change the two incorrect “Fig. 1C” references (when describing accuracy) to “Fig. 1B”.

      - Also, where is the 'SD' coefficient of -0.254 (p-value = 0.123) coming from in line 211? I can't match this to the figure.

      This was a typographical error in an earlier draft. The correct coefficients are those shown in the figure and reported elsewhere in the text (evidence-specific effects: for PDM RTs, SD β = −0.057, p < 0.001; for VDM RTs, VD β = −0.016, p = 0.011; non-relevant evidence terms are n.s.). We removed the erroneous value.

      Revision:

      Results → Behaviour: validity… (sentence with −0.254). Delete the incorrect value and retain the evidence-specific coefficients consistent with Fig. 1B–C.

      - It is reported that reaction times were significantly faster for the perceptual relative to the value-based decision task. Was overall accuracy also significantly different between the two tasks? It appears from Figure 3 that it might be, But I couldn't find this reported in the text.

      To avoid conflating task with evidence composition, we did not emphasize between-task accuracy averages. Our primary tests examine evidence-specific effects and TMS-induced changes within task. For completeness, we now report descriptive mean accuracies by task and point readers to the figure panels that display accuracy as a function of evidence (which is the meaningful comparison in our matched-evidence design). We refrain from additional hypothesis testing here to keep the analyses aligned with our preregistered focus.

      Revision:

      Results → Behaviour: validity… Add:

      “For completeness, group-mean accuracies by task are provided descriptively in Fig. 3a; inferential tests in the manuscript focus on evidence-specific effects and TMS-induced changes within task.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Lack of Sensitivity Analyses for some Key Methodological Decisions: Certain methodological choices in this manuscript diverge from approaches used in previous works. In these cases, I recommend the following: (i) The authors could provide a clear and detailed justification for these deviations from established methods, and (ii) supplementary sensitivity analyses could be included to ensure the robustness of the findings, demonstrating that the results are not driven primarily by these methodological changes. Below, I outline the main areas where such evaluations are needed:

      This detailed guidance is incredibly valuable, and we are grateful. Work of this kind is in its relative infancy, and there are so many design choices depending on the data available, questions being addressed, and so on. Help us navigate that has been extremely useful. In our revised manuscript we are very happy to add additional justification for design choices made, and wherever possible test the impact of those choices. It is certainly the case that different approaches have been used across the handful of papers published in this space, and, unlike in other areas of systems neuroscience, we have yet to reach the point where any of these approaches are established. We agree with the reviewer that wherever possible these design choices should be tested. 

      Use of Communicability Matrices for Structural Connectivity Gradients: The authors chose to construct structural connectivity gradients using communicability matrices, arguing that diffusion map embedding "requires a smooth, fully connected matrix." However, by definition, the creation of the affinity matrix already involves smoothing and ensures full connectedness. I recommend that the authors include an analysis of what happens when the communicability matrix step is omitted. This sensitivity test is crucial, as it would help determine whether the main findings hold under a simpler construction of the affinity matrix. If the results significantly change, it could indicate that the observations are sensitive to this design choice, thereby raising concerns about the robustness of the conclusions. Additionally, if the concern is related to the large range of weights in the raw structural connectivity (SC) matrix, a more conventional approach is to apply a log-transformation to the SC weights (e.g., log(1+𝑆𝐶<sub>𝑖𝑗</sub>)), which may yield a more reliable affinity matrix without the need for communicability measures.

      The reason we used communicability is indeed partly because we wanted to guarantee a smooth fully connected matrix, but also because our end goal for this project was to explore structure-function coupling in these low-dimensional manifolds.  Structural communicability – like standard metrics of functional connectivity – includes both direct and indirect pathways, whereas streamline counts only capture direct communication. In essence we wanted to capture not only how information might be routed from one location to another, but also the more likely situation in which information propagates through the system. 

      In the revised manuscript we have given a clearer justification for why we wanted to use communicability as our structural measure (Page 4, Line 179):

      “To capture both direct and indirect paths of connectivity and communication, we generated weighted communicability matrices using SIFT2-weighted fibre bundle capacity (FBC). These communicability matrices reflect a graph theory measure of information transfer previously shown to maximally predict functional connectivity (Esfahlani et al., 2022; Seguin et al., 2022). This also foreshadowed our structure-function coupling analyses, whereby network communication models have been shown to increase coupling strength relative to streamline counts (Seguin et al., 2020)”.

      We have also referred the reader to a new section of the Results that includes the structural gradients based on the streamline counts (Page 7, line 316):

      “Finally, as a sensitivity analysis, to determine the effect of communicability on the gradients, we derived affinity matrices for both datasets using a simpler measure: the log of raw streamline counts. The first 3 components derived from streamline counts compared to communicability were highly consistent across both NKI  (r<sub>s</sub> = 0.791, r<sub>s</sub> = 0.866, r<sub>s</sub> = 0.761) and the referred subset of CALM (r<sub>s</sub> = 0.951, r<sub>s</sub> = 0.809, r<sub>s</sub> = 0.861), suggesting that in practice the organisational gradients are highly similar regardless of the SC metric used to construct the affinity matrices”. 

      Methodological ambiguity/lack of clarity in the description of certain evaluation steps: Some aspects of the manuscript’s methodological description are ambiguous, making it challenging for future readers to fully reproduce the analyses based on the information provided. I believe the following sections would benefit from additional detail and clarification:

      Computation of Manifold Eccentricity: The description of how eccentricity was computed (both in the results and methods sections) is unclear and may be problematic. The main ambiguity lies in how the group manifold origin was defined or computed. (1) In the results section, it appears that separate manifold origins were calculated for the NKI and CALM groups, suggesting a dataset-specific approach. (2) Conversely, the methods section implies that a single manifold origin was obtained by somehow combining the group origins across the three datasets, which seems contradictory. Moreover, including neurodivergent individuals in defining the central group manifold origin in conceptually problematic. Given that neurodivergent participants might exhibit atypical brain organization, as suggested by Figure 1, this inclusion could skew the definition of what should represent a typical or normative brain manifold. A more appropriate approach might involve constructing the group manifold origin using only the neurotypical participants from both the NKI and CALM datasets. Given the reported similarity between group-level manifolds of neurotypical individuals in CALM and NKI, it would be reasonable to expect that this combined origin should be close to the origin computed within neurotypical samples of either NKI or CALM. As a sanity check, I recommend reporting the distance of the combined neurotypical manifold origin to the centres of the neurotypical manifolds in each dataset. Moreover, if the manifold origin was constructed while utilizing all samples (including neurodivergent samples) I think this needs to be reconsidered. 

      This is a great point, and we are very happy to clarify. Separate manifolds were calculated for the NKI and CALM participants, hence a dataset-specific approach. Indeed, in the long-run our goal was to explore individual differences in these manifolds, relative to the respective group-level origins, and their intersection across modalities, so manifold eccentricity was calculated at an individual level for subsequent analyses. At the group level, for each modality, we computed 3 manifold origins: one for NKI, one for the referred subset of CALM, and another for the neurotypical portion of CALM. Crucially, because the manifolds are always normal, in each case the manifold origin point is near-zero (extremely near-zero, to the 6<sup>th</sup> or 7<sup>th</sup> decimal place). In other words, we do indeed calculate the origin separately each time we calculate the gradients, but the origin is zero in every case. As a result, differences in the origin point cannot be the source of any differences we observe in manifold eccentricity between groups or individuals. We have updated the Methods section with the manifold origin points for each dataset and clarified our rationale (Page 16, Line 1296):

      “Note that we used a dataset-specific approach when we computed manifold eccentricity for each of the three groups relative to their group-level origin: neurotypical CALM (SC origin = -7.698 x 10<sup>-7</sup>, FC origin = 6.724 x 10<sup>-7</sup>), neurodivergent CALM (SC origin = -6.422 x 10 , FC origin = 1.363 x 10 ), and NKI (SC origin = -7.434 x 10 , FC origin = 4.308 x 10<sup>-6</sup>). Eccentricity is a relative measure and thus normalised relative to the origin. Because of this normalisation, each time gradients are constructed the manifold origin is necessarily near-zero, meaning that differences in manifold eccentricity of individual nodes, either between groups or individuals, are stem from the eccentricity of that node rather than a difference in origin point”. 

      We clarified the computation of the respective manifold origins within the Results section, and referred the reader to the relevant Methods section (Page 9, line 446):

      “For each modality (2 levels: SC and FC) and dataset (3 levels: neurotypical CALM, neurodivergent CALM, and NKI), we computed the group manifold origin as the mean of their respective first three gradients. Because of the normal nature of the manifolds this necessarily means that these origin points will be very near-zero, but we include the exact values in the ‘Manifold Eccentricity’ methodology sub-section”. 

      Individual-Level Gradients vs. Group-Level Gradients: Unlike previous studies that examined alterations in principal gradients (e.g., Xia et al., 2022; Dong et al., 2021), this manuscript focuses on gradients derived directly from individual-level data. In contrast, earlier works have typically computed gradients based on grouped data, such as using a moving window of individuals based on age (Xia et al.) or evaluating two distinct age groups (Dong et al.). I believe it is essential to assess the sensitivity of the findings to this methodological choice. Such an evaluation could clarify whether the observed discrepancies with previous reports are due to true biological differences or simply a result of different analytical strategies.

      This is a brilliant point. The central purpose of our project was to test how individual differences in these gradients, and their intersection across modalities, related to differences in phenotype (e.g. cognitive difficulties). These necessitated calculating gradients at the level of individuals and building a pipeline to do so, given that we could find no other examples. Nonetheless, despite this different goal and thus approach, we had expected to replicate a couple of other key findings, most prominently the ‘swapping’ of gradients shown by Dong et al. (2021). We were also surprised that we did not find this changing in order. The reviewer is right and there could be several design features that produce the difference, and in the revised manuscript we test several of them. We have added the following text to the manuscript as a sensitivity analysis for the Results sub-section titled “Stability of individual-level gradients across developmental time” (Page 7, Line 344 onwards):

      “One possibility is that our observation of gradient stability – rather than a swapping of the order for the first two gradients (Dong et al., 2021) – is because we calculated them at an individual level. To test this, we created subgroups and contrasted the first two group-level structural and functional gradients derived from children (younger than 12 years old) versus those from adolescents (12 years old and above), using the same age groupings as prior work (Dong et al., 2021). If our use of individually calculated gradients produces the stability, then we should observe the swapping of gradients in this sensitivity analysis. Using baseline scans from NKI, the primary structural gradient in childhood (N = 99) as shown in Figure 1f, this was highly correlated (r<sub>s</sub> = 0.995) with those derived from adolescents (N = 123). Likewise, the secondary structural gradient in childhood was highly consistent in adolescence (r<sub>s</sub> = 0.988). In terms of functional connectivity, the principal gradient in childhood (N = 88) was highly consistent in adolescence (r<sub>s</sub> = 0.990, N = 125). The secondary gradient in childhood was again highly similar in adolescence (r<sub>s</sub> = 0.984). The same result occurred in the CALM dataset: In the baseline referred subset of CALM, the primary and secondary communicability gradients derived from children (N = 258) and adolescents (N = 53) were near-identical (r<sub>s</sub> = 0.991 and r<sub>s</sub> = 0.967, respectively). Alignment for the primary and secondary functional gradients derived from children (N = 130) and adolescents (N = 43) were also near-identical (r<sub>s</sub> = 0.972 and r<sub>s</sub> = 0.983, respectively). These consistencies across development suggest that gradients of communicability and functional connectivity established in childhood are the same as those in adolescence, irrespective of group-level or individual-level analysis. Put simply, our failure to replicate the swapping of gradient order in Dong et al. (2021) is not the result of calculating gradients at the level of individual participants.”

      Procrustes Transformation: It is unclear why the authors opted to include a Procrustes transformation in this analysis, especially given that previous related studies (e.g., Dong et al.) did not apply this step. I believe it is crucial to evaluate whether this methodological choice influences the results, particularly in the context of developmental changes in organizational gradients. Specifically, the Procrustes transformation may maximize alignment to the group-level gradients, potentially masking individual-level differences. This could result in a reordering of the gradients (e.g., swapping the first and second gradients), which might obscure true developmental alterations. It would be informative to include an analysis showing the impact of performing vs. omitting the Procrustes transformation, as this could help clarify whether the observed effects are robust or an artifact of the alignment procedure. (Please also refer to my comment on adding a subplot to Figure 1). Additionally, clarifying how exactly the transformation was applied to align gradients across hemispheres, individuals, and/or datasets would help resolve ambiguity. 

      The current study investigated individual differences in connectome organisation, rather than group-level trends (Dong et al., 2021). This necessitates aligning individual gradients to the corresponding group-level template using a Procrustes rotation. Without a rotation, there is no way of knowing if you are comparing  ‘like with like’: the manifold eccentricity of a given node may appear to change across individuals simply due to subtle differences in the arbitrary orientation of the underlying manifolds. We also note that prior work examining individual differences in principal alignment have used Procrustes (Xia et al., 2022), who demonstrated emergence of the principal gradient across development, albeit with much smaller effects than Dong and colleagues (2021). Nonetheless, we agree, the Procrustes rotation could be another source of the differences we observed with the previous paper (Dong et al. 2021). We explored the impact of the Procrustes rotation on individual gradients as our next sensitivity analysis. We recalculated everyone’s gradients without Procrustes rotation. We then tested the alignment of each participant with the group-level gradients using Spearman’s correlations, followed by a series of generalised linear models to predict principal gradient alignment using head motion, age, and sex. The expected swapping of the first and second functional gradient (Dong et al., 2021) would be represented by a decrease in the spatial similarity of each child’s principal functional gradient to the principal childhood group-level gradient, at the onset of adolescence (~age 12). However, there is no age effect on this unrotated alignment, suggesting that the lack of gradient swapping in our data does not appear to be the result of the Procrustes rotation. When you use unrotated individual gradients the alignment is remarkably consistent across childhood and adolescence. Alignment is, however, related to head motion, which is often related to age. To emphasise the importance of motion, particularly in relation to development, we conducted a mediation analysis between the relationship between age and principal alignment (without correcting for motion), with motion as a mediator, within the NKI dataset. Before accounting for motion, the relationship between age and principal alignment is significant, but this can be entirely accounted for by motion. In our revised manuscript we have included this additional analysis in the Results sub-section titled “Stability of individual-level gradients across developmental time”, following on from the above point about the effect of group-level versus individual-level analysis (Page 8, Line 400):

      “A second possible discrepancy between our results and that of prior work examining developmental change in group-level functional gradients (Dong et al., 2021) was the use of Procrustes alignment. Such alignment of individual-level gradients to group-level templates is a necessary step to ensure valid comparisons between corresponding gradients across individuals, and has been implemented in sliding-window developmental work tracking functional gradient development (Xia et al., 2022). Nonetheless, we tested whether our observation of stable principal functional and communicability gradients may be an artefact of the Procrustes rotation. We did this by modelling how individual-level alignment without Procrustes rotation to the group-level templates varies with age, head motion, and sex, as a series of generalised linear models. We included head motion as the magnitude of the Procrustes rotation has been shown to be positively correlated with mean framewise displacement (Sasse et al., 2024), and prior group-level work (Dong et al., 2021) included an absolute motion threshold rather than continuous motion estimates. Using the baseline referred CALM sample, there was no significant relationship between alignment and age (β = -0.044, 95% CI = [-0.154, 0.066], p = 0.432) after accounting for head motion and sex. Interestingly, however head motion was significantly associated with alignment ( β = -0.318, 95% CI = [-0.428, -.207], p = 1.731 x 10<sup>-8</sup>), such that greater head motion was linked to weaker alignment. Note that older children tended to have exhibit less motion for their structural scans (r<sub>s</sub> = 0.335, p < 0.001). We observed similar trends in functional alignment, whereby tighter alignment was significantly predicted by lower head motion (β = -0.370, 95% CI = [-0.509, -0.231], p = 1.857 x 10<sup>-7</sup>), but not by age (β= 0.049, 95% CI = [-0.090, 0.187], p = 0.490). Note that age and head motion for functional scans were not significantly related (r<sub>s</sub> = -0.112, p = 0.137). When repeated for the baseline scans of NKI, alignment with the principal structural gradient was not significantly predicted by either scan age (β = 0.019, 95% CI = [-0.124, 0.163], p = 0.792) or head motion (β = -0.133, 95% CI = [-0.175, 0.009], p = 0.067) together in a single model, where age and motion were negatively correlated (r<sub>s</sub> = -0.355, p < 0.001). Alignment with the principal functional gradient was significantly predicted by head motion (β = -0.183, 95% CI = [-0.329, -0.036], p = 0.014) but not by age (β= 0.066, 95% CI = [-0.081, 0.213], p = 0.377), where age and motion were also negatively correlated (r<sub>s</sub> = -0.412, p < 0.001). Across modalities and datasets, alignment with the principal functional gradient in NKI was the only example in which there was a significant correlation between alignment and age (r<sub>s</sub> = 0.164, p = 0.017) before accounting for head motion and sex. This suggests that apparent developmental effects on alignment are minimal, and where they do exist they are removed after accounting for head motion. Put together this suggests that the lack of order swapping for the first two gradients is not the result of the Procrustes rotation – even without the rotation there is no evidence for swapping”.

      “To emphasise the importance of head motion in the appearance of developmental change in alignment, we examined whether accounting for head motion removes any apparent developmental change within NKI. Specifically, we tested whether head motion mediates the relationship between age and alignment (Figure 1X), controlling for sex, given that higher motion is associated with younger children (β= -0.429, 95% CI = [0.552, -0.305], p = 7.957 x 10<sup>-11</sup>), and stronger alignment is associated with reduced motion (β = -0.211, 95% CI = [-0.344, -0.078], p = 2.017 x 10<sup>-3</sup>). Motion mediated the relationship between age and alignment (β = 0.078, 95% CI = [0.006, 0.146], p = 1.200 x 10<sup>-2</sup>), accounting for 38.5% variance in the age-alignment relationship, such that the link between age and alignment became non-significant after accounting for motion (β = 0.066, 95% CI = [-0.081, 0.214], p = 0.378). This firstly confirms our GLM analyses, where we control for motion and find no age associations. Moreover, this suggests that caution is required when associations between age and gradients are observed. In our analyses, because we calculate individual gradients, we can correct for individual differences in head motion in all our analyses. However, other than using an absolute motion threshold and motion-matched child and adolescent groups, individual differences in motion were not accounted for by prior work which demonstrated a flipping of the principal functional gradients with age (Dong et al., 2021)”. 

      We further clarify the use of Procrustes rotation as a separate sub-section within the Methods (Page 25, Line 1273):

      “Procrustes Rotation

      For group-level analysis, for each hemisphere we constructed an affinity matrix using a normalized angle kernel and applied diffusion-map embedding. The left hemisphere was then aligned to the right using a Procrustes rotation. For individual-level analysis, eigenvectors for the left hemisphere were aligned with the corresponding group-level rotated eigenvectors. No alignment was applied across datasets. The only exception to this was for structural gradients derived from the referred CALM cohort. Specifically, we aligned the principal gradient of the left hemisphere to the secondary gradient of the right hemisphere: this was due to the first and second gradients explaining a very similar amount of variance, and hence their order was switched”. 

      SC-FC Coupling Metric: The approach used to quantify nodal SC-FC coupling in this study appears to deviate from previously established methods in the field. The manuscript describes coupling as the "Spearman-rank correlation between Euclidean distances between each node and all others within structural and functional manifolds," but this description is unclear and lacks sufficient detail. Furthermore, this differs from what is typically referred to as SC-FC coupling in the literature. For instance, the cited study by Park et al. (2022) utilizes a multiple linear regression framework, where communicability, Euclidean distance, and shortest path length are independent variables predicting functional connectivity (FC), with the adjusted R-squared score serving as the coupling index for each node. On the other hand, the Baum et al. (2020) study, also cited, uses Spearman correlation, but between raw structural connectivity (SC) and FC values. If the authors opt to introduce a novel coupling metric, it is essential to demonstrate its similarity to these previous indices. I recommend providing an analysis (supplementary) showing the correlation between their chosen metric and those used in previous studies (e.g., the adjusted R-squared scores from Park et al. or the SC-FC correlation from Baum et al.). Furthermore, if the metrics are not similar and results are sensitive to this alternative metric, it raises concerns about the robustness of the findings. A sensitivity analysis would therefore be helpful (in case the novel coupling metric is not like previous ones) to determine whether the reported effects hold true across different coupling indices.

      This is a great point, and we are happy to take the reviewer’s recommendation. There are multiple different ways of calculating structure-function coupling. For our set of questions, it was important that our metric incorporated information about the structural and functional manifolds, rather than being a separate approach that is unrelated to these low-dimensional embeddings. Put simply, we wanted our coupling measure to be about the manifolds and gradients outlined in the early sections of the results. We note that the multiple linear regression framework was developed by Vázquez-Rodríguez and colleagues (2019), whilst the structure-function coupling computed in manifold space by Park and colleagues (2022) was operationalised as a linear correlation between z-transformed functional connectomes and structural differentiation eigenvectors. To clarify how this coupling was calculated, and to justify why we developed a new coupling method based on manifolds rather than borrow an existing approach from the literature, we have revised the manuscript to make this far clearer for readers (Page 13, line 604):

      “To examine the relationship between each node’s relative position in structural and functional manifold space, we turned our attention to structure-function coupling. Whilst prior work typically computed coupling using raw streamline counts and functional connectivity matrices, either as a correlation (Baum et al., 2020) or through a multiple linear regression framework (Vázquez-Rodríguez et al., 2019), we opted to directly incorporate low-dimensional embeddings within our coupling framework. Specifically, as opposed to correlating row-wise raw functional connectivity with structural connectivity eigenvectors (Park et al., 2022), our metric directly incorporates the relative position of each node in low-dimensional structural and functional manifold spaces. Each node was situated in a low-dimensional 3D space, the axes of which were each participant’s gradients, specific to each modality. For each participant and each node, we computed the Euclidean distance with all other nodes within structural and functional manifolds separately, producing a vector of size 200 x 1 per modality. The nodal coupling coefficient was the Spearman correlation between each node’s Euclidean distance to all other nodes in structural manifold space, and that in functional manifold space. Put simply, a strong nodal coupling coefficient suggests that that node occupies a similar location in structural space, relative to all other nodes, as it does in functional space”. 

      We also agree with the reviewer’s recommendation to compare this to some of the more standard ways of calculating coupling. We compare our metric with 3 others (Baum et al., 2020; Park et al., 2022; VázquezRodríguez et al., 2019), and find that all metrics capture the core developmental sensorimotor-to-association axis (Sydnor et al., 2021). Interestingly, manifold-based coupling measures captured this axis more strongly than non-manifold measures. We have updated the Results accordingly (Page 14, Line 638):

      “To evaluate our novel coupling metric, we compared its cortical spatial distribution to three others (Baum et al., 2020; Park et al., 2022; Vázquez-Rodríguez et al., 2019), using the group-level thresholded structural and functional connectomes from the referred CALM cohort. As shown in Figure 4c, our novel metric was moderately positively correlated to that of a multi-linear regression framework (r<sub>s</sub> = 0.494, p<sub>spin</sub> = 0.004; Vázquez-Rodríguez et al., 2019) and nodal correlations of streamline counts and functional connectivity (r<sub>s</sub> = 0.470, p<sub>spin</sub> = 0.005; Baum et al., 2020). As expected, our novel metric was strongly positively correlated to the manifold-derived coupling measure (r<sub>s</sub> = 0.661, p<sub>spin</sub> < 0.001; Park et al., 2022), more so than the first (Z(198) = 3.669, p < 0.001) and second measure (Z(198) = 4.012, p < 0.001). Structure-function coupling is thought to be patterned along a sensorimotor-association axis (Sydnor et al., 2021): all four metrics displayed weak-tomoderate alignment (Figure 4c). Interestingly, the manifold-based measures appeared most strongly aligned with the sensorimotor-association axis: the novel metric was more strongly aligned than the multi-linear regression framework (Z(198) = -11.564, p < 0.001) and the raw connectomic nodal correlation approach (Z(198) = -10.724, p < 0.001), but the previously-implemented structural manifold approach was more strongly aligned than the novel metric  (Z(198) = -12.242, p < 0.001). This suggests that our novel metric exhibits the expected spatial distribution of structure-function coupling, and the manifold approach more accurately recapitulates the sensorimotor-association axis than approaches based on raw connectomic measures”.

      We also added the following to the legend of Figure 4 on page 15:

      “d. The inset Spearman correlation plot of the 4 coupling measures shows moderate-to-strong correlations (p<sub>spin</sub> < 0.005 for all spatial correlations). The accompanying lollypop plot shows the alignment between the sensorimotor-to-association axis and each of the 4 coupling measures, with the novel measure coloured in light purple (p<sub>spin</sub> < 0.007 for all spatial correlations)”. 

      Prediction vs. Association Analysis: The term “prediction” is used throughout the manuscript to describe what appear to be in-sample association tests. This terminology may be misleading, as prediction generally implies an out-of-sample evaluation where models trained on a subset of data are tested on a separate, unseen dataset. If the goal of the analyses is to assess associations rather than make true predictions, I recommend refraining from the term “prediction” and instead clarifying the nature of the analysis. Alternatively, if prediction is indeed the intended aim (which would be more compelling), I suggest conducting the evaluations using a k-fold cross-validation framework. This would involve training the Generalized Additive Mixed Models (GAMMs) on a portion of the data and training their predictive accuracy on a held-out sample (i.e. different individuals). Additionally, the current design appears to focus on predicting SC-FC coupling using cognitive or pathological dimensions. This is contrary to the more conventional approach of predicting behavioural or pathological outcomes from brain markers like coupling. Could the authors clarify why this reverse direction of analysis was chosen? Understanding this choice is crucial, as it impacts the interpretation and potential implications of the findings. 

      We have replaced “prediction” with “association” across the manuscript. However, for analyses corresponding to Figure 5, which we believe to be the most compelling, we conducted a stratified 5-fold cross-validation procedure, outlined below, repeated 100 times to account for random variation in the train-test splits. To assess whether prediction accuracy in the test splits was significantly greater than chance, we compared our results to those derived from a null dataset in which cognitive factor 2 scores had been permuted across participants. To account for the time-series element and block design of our data, in that some participants had 2 or more observations, we permuted entire participant blocks of cognitive factor 2 scores, keeping all other variables, including covariates, the same. Included in our manuscript are methodological details and results pertaining to this procedure. Specifically, the following has been added to the Results (Page 16, Line 758):

      “To examine the predictive value of the second cognitive factor for global and network-level structure-function coupling, operationalised as a Spearman rank correlation coefficient, we implemented a stratified 5-fold crossvalidation framework, and predictive accuracy compared with that of a null data frame with cognitive factor 2 scores permuted across participant blocks (see ‘GAMM cross-validation’ in the Methods). This procedure was repeated 100 times to account for randomness in the train-test splits, using the same model specification as above. Therefore, for each of the 5 network partitions in which an interaction between the second cognitive factor and age was a significant predictor of structure-function coupling (global, visual, somato-motor, dorsal attention, and default-mode), we conducted a Welch’s independent-sample t-test to compare 500 empirical prediction accuracies with 500 null prediction accuracies. Across all 5 network partitions, predictive accuracy of coupling was significantly higher than that of models trained on permuted cognitive factor 2 scores (all p < 0.001). We observed the largest difference between empirical (M = 0.029, SD = 0.076) and null (M = -0.052, SD = 0.087) prediction accuracy in the somato-motor network [t (980.791) = 15.748, p < 0.001, Cohen’s d = 0.996], and the smallest difference between empirical (M = 0.080, SD = 0.082) and null (M = 0.047, SD = 0.081) prediction accuracy in the dorsal attention network [t (997.720) = 6.378, p < 0.001, Cohen’s d = 0.403]. To compare relative prediction accuracies, we ordered networks by descending mean accuracy and conducted a series of Welch’s independent sample t-tests, followed by FDR correction (Figure 5X). Prediction accuracy was highest in the default-mode network (M = 0.265, SD = 0.085), two-fold that of global coupling (t(992.824) = 25.777, p<sub>FDR</sub> = 5.457 x 10<sup>-112</sup>, Cohen’s d = 1.630, M = 0.131, SD = 0.079). Global prediction accuracy was significantly higher than the visual network (t (992.644) = 9.273, p<sub>FDR</sub> = 1.462 x 10<sup>-19</sup>, Cohen’s d = 0.586, M = 0.083, SD = 0.085), but visual prediction accuracy was not significantly higher than within the dorsal attention network (t (997.064) = 0.554, p<sub>FDR</sub> = 0.580, Cohen’s d = 0.035, M = 0.080, SD = 0.082). Finally, prediction accuracy within the dorsal attention network was significantly stronger than that of the somato-motor network [t (991.566) = 10.158, p<sub>FDR</sub> = 7.879 x 10<sup>-23</sup>, Cohen’s d = 0.642 M = 0.029, SD = 0.076]. Together, this suggests that out-of-sample developmental predictive accuracy for structure-function coupling, using the second cognitive factor, is strongest in the higher-order default-mode network, and lowest in the lower-order somatosensory network”. 

      We have added a separate section for GAMM cross-validation in the Methods (Page 27, Line 1361):

      GAMM cross-validation

      “We implemented a 5-fold cross validation procedure, stratified by dataset (2 levels: CALM or NKI). All observations from any given participant were assigned to either the testing or training fold, to prevent data leakage, and the cross-validation procedure was repeated 100 times, to account for randomness in data splits. The outcome was predicted global or network-level structure-function coupling across all test splits, operationalised as the Spearman rank correlation coefficient. To assess whether prediction accuracy exceeded chance, we compared empirical prediction accuracy with that of GAMMs trained and tested on null data in which cognitive factor 2 scores were permuted across subjects. The number of observations formed 3 exchangeability blocks (N = 320 with one observation, N = 105 with two observations, and N = 33 with three observations), whereby scores from a participant with two observations were replaced by scores from another participant with two observations, with participant-level scores kept together, and so on for all numbers of observations. We compared empirical and null prediction accuracies using independent sample t-tests as, although the same participants were examined, the shuffling meant that the relative ordering of participants within both distributions was not preserved. For parallelisation and better stability when estimating models fit on permuted data, we used the bam function from the mgcv R package (Wood, 2017)”. 

      We also added a justification for why we predicted coupling using behaviour or psychopathology, rather than vice versa (Page 27, Line 1349):

      “When using our GAMMs to test for the relationship between cognition and psychopathology and our coupling metrics, we opted to predict structure-function coupling using cognitive or psychopathological dimensions, rather than vice versa, to minimise multiple comparisons. In the current framework, we corrected for 8 multiple comparisons within each domain. This would have increased to 16 multiple comparison corrections for predicting two cognitive dimensions using network-level coupling, and 24 multiple comparison corrections for predicting three psychopathology dimensions. Incorporating multiple networks as predictors within the same regression framework introduces collinearity, whilst the behavioural dimensions were orthogonal: for example, coupling is strongly correlated between the somato-motor and ventral attention networks (r<sub>s</sub> = 0.721), between the default-mode and frontoparietal networks (r<sub>s</sub> = 0.670), and between the dorsal attention and fronto-parietal networks (r<sub>s</sub> = 0.650)”. 

      Finally, we noticed a rounding error in the ages of the data frame containing the structure-function coupling values and the cognitive/psychopathology dimensions. We rectified this and replaced the GAMM results, which largely remained the same. 

      In typical applications of diffusion map embedding, sparsification (e.g., retaining only the top 10  of the strongest connections) is often employed at the vertex-level resolution to ensure computational feasibility. However, since the present study performs the embedding at the level of 200 brain regions (a considerably coarser resolution), this step may not be necessary or justifiable. Specifically, for FC, it might be more appropriate to retain all positive connections rather than applying sparsification, which could inadvertently eliminate valuable information about lower-strength connections. Whereas for SC, as the values are strictly non-negative, retaining all connections should be feasible and would provide a more complete representation of the structural connectivity patterns. Given this, it would be helpful if the authors could clarify why they chose to include sparsification despite the coarser regional resolution, and whether they considered this alternative approach (using all available positive connections for FC and all non-zero values for SC). It would be interesting if the authors could provide their thoughts on whether the decision to run evaluations at the resolution of brain regions could itself impact the functional and structural manifolds, their alteration with age, and or their stability (in contrast to Dong et al. which tested alterations in highresolution gradients).

      This is another great point. We could retain all connections, but we usually implement some form of sparsification to reduce noise, particularly in the case of functional connectivity. But we nonetheless agree with the reviewer’s point. We should check what impact this is having on the analysis. In brief, we found minimal effects of thresholding, suggesting that the strongest connections are driving the gradient (Page 7, Line 304):

      “To assess the effect of sparsity on the derived gradients, we examined group-level structural (N = 222) and functional (N = 213) connectomes from the baseline session of NKI. The first three functional connectivity gradients derived using the full connectivity matrix (density = 92%) were highly consistent with those obtained from retaining the strongest 10% of connections in each row (r<sub>1</sub> = 0.999, r<sub>2</sub> = 0.998, r<sub>3</sub> < 0.999, all p < 0.001). Likewise, the first three communicability gradients derived from retaining all streamline counts (density = 83%) were almost identical to those obtained from 10% row-wise thresholding (r<sub>1</sub> = 0.994, r<sub>2</sub> = 0.963, r<sub>3</sub> = 0.955, all p < 0.001). This suggests that the reported gradients are driven by the strongest or most consistent connections within the connectomes, with minimal additional information provided by weaker connections. In terms of functional connectivity, such consistency reinforces past work demonstrating that the sensorimotor-toassociation axis, the major axis within the principal functional connectivity gradient, emerges across both the top- and bottom-ranked functional connections (Nenning et al., 2023)”.

      Furthermore, we appreciate the nudge to share our thoughts on whether the difference between vertex versus nodal metrics could be important here, particularly regarding thresholds. To combine this point with R2’s recommendation to expand the Discussion, we have added the following paragraph (Page 19, Line 861): 

      “We consider the role of thresholding, cortical resolution, and head motion as avenues to reconcile the present results with select reports in the literature (Dong et al., 2021; Xia et al., 2022). We would suggest that thresholding has a greater effect on vertex-level data, rather than parcel-level. For example, a recent study revealed that the emergence of principal vertex-level functional connectivity gradients in childhood and adolescence are indeed threshold-dependent (Dong et al., 2024). Specifically, the characteristic unimodal organisation for children and transmodal organisation for adolescents only emerged at the 90% threshold: a 95% threshold produced a unimodal organisation in both groups, whilst an 85% threshold produced a transmodal organisation in both groups. Put simply, the ‘swapping’ of gradient orders only occurs at certain thresholds. Furthermore, our results are not necessarily contradictory to this prior report (Dong et al., 2021): developmental changes in high-resolution gradients may be supported by a stable low-dimensional coarse manifold. Indeed, our decision to use parcellated connectomes was partly driven by recent work which demonstrated that vertex-level functional gradients may be derived using biologically-plausible but random data with sufficient spatial smoothing, whilst this effect is minimal at coarser resolutions (Watson & Andrews, 2023). We observed a gradual increase in the variance of individual connectomes accounted for by the principal functional connectivity gradient in the referred subset of CALM, in line with prior vertex-level work demonstrating a gradual emergence of the sensorimotor-association axis as the principal axis of connectivity (Xia et al., 2022), as opposed to a sudden shift. It is also possible that vertex-level data is more prone to motion artefacts in the context of developmental work. Transitioning from vertex-level to parcel-level data involves smoothing over short-range connectivity, thus greater variability in short-range connectivity can be observed in vertex-level data. However, motion artefacts are known to increase short-range connectivity and decrease long-range connectivity, mimicking developmental changes (Satterthwaite et al., 2013). Thus, whilst vertexlevel data offers greater spatial resolution in representation of short-range connectivity relative to parcel-level data, it is possible that this may come at the cost of making our estimates of the gradients more prone to motion”.

      Evaluating the consistency of gradients across development: the results shown in Figure 1e are used as evidence suggesting that gradients are consistent across ages. However, I believe additional analyses are required to identify potential sources of the observed inconsistency compared to previous works. The claim that the principal gradient explains a similar degree of variance across ages does not necessarily imply that the spatial structure remains the same. The observed variance explanation is hence not enough to ascertain inconsistency with findings from Dong et al., as the spatial configuration of gradients may still change over time. I suggest the following additional analyses to strengthen this claim. Alignment to group-level gradients: Assess how much of the variance in individual FC matrices is explained by each of the group-level gradients (G1, G2, and G3, for both FC and SC). This analysis could be visualized similarly to Figure 1e, with age on the x-axis and variance explained on the y-axis. If the explained variance varies as a function of age, it may indicate that the gradients are not as consistent as currently suggested. 

      This is another great suggestion. In the additional analyses above (new group-level analyses and unrotated gradient analyses) we rule-out a couple of the potential causes of the different developmental trends we observe in our data – namely the stability of the gradients over time. The suggested additional analysis is a great idea, and we have implemented it as follows (Page 8, Line 363):

      “To evaluate the consistency of gradients across development, across baseline participants with functional connectomes from the referred CALM cohort (N = 177), we calculated the proportion of variance in individuallevel connectomes accounted for by group-level functional gradients. Specifically, we calculated the proportion of variance in an adjacency matrix A accounted for by the vector v<sub>i</sub> as the fraction of the square of the scalar projection of v<sub>i</sub> onto A, over the Frobenius norm of A. Using a generalised linear model, we then tested whether the proportion of variance explained varies systematically with age, controlling for sex and headmotion. The variance in individual-level functional connectomes accounted for by the group-level principal functional gradient gradually increased with development (β= 0.111, 95% CI = [0.022, 0.199], p = 1.452 x 10<sup>-2</sup>, Cohen’s d = 0.367), as shown in Figure 1g, and decreased with higher head motion ( β = -10.041, 95% CI = [12.379, -7.702], p = 3.900 x 10<sup>-17</sup>), with no effect of sex (β= 0.071, 95% CI = [-0.380, 0.523], p = 0.757). We observed no developmental effects on the variance explained by the second (r<sub>s</sub> = 0.112, p = 0.139) or third (r<sub>s</sub> = 0.053, p = 0.482) group-level functional gradient. When repeated with the baseline functional connectivity for NKI (N = 213), we observed no developmental effects (β = 0.097, 95% CI = [-0.035, 0.228], p = 0.150) on the variance explained by the principal functional gradient after accounting for motion (β= -3.376, 95% CI = [8.281, 1.528], p = 0.177) and sex (β = -0.368, 95% CI = [-1.078, 0.342], p = 0.309). However, we observed significant developmental correlations between age and variance (r<sub>s</sub> = 0.137, p = 0.046) explained before accounting for head motion and sex. We observed no developmental effects on the variance explained by the second functional gradient (r<sub>s</sub> = -0.066, p = 0.338), but a weak negative developmental effect on the variance explained by the third functional gradient (r<sub>s</sub> = -0.189, p = 0.006). Note, however, the magnitude of the variance accounted for by the third functional gradient was very small (all < 1%). When applied to communicability matrices in CALM, the proportion of variance accounted for by the group-level communicability gradient was negligible (all < 1%), precluding analysis of developmental change”. 

      “To further probe the consistency of gradients across development, we examined developmental changes in the standard deviation of gradient values, corresponding to heterogeneity, following prior work examining morphological (He et al., 2025) and functional connectivity gradients (Xia et al., 2022). Using a series of generalised linear models within the baseline referred subset of CALM, correcting for head motion and sex, we found that gradient variation for the principal functional gradient increased across development (= 0.219, 95% CI = [0.091, 0.347], p = 0.001, Cohen’s d = 0.504), indicating greater heterogeneity (Figure 1h), whilst gradient variation for the principal communicability gradient decreased across development (β = -0.154, 95% CI = [-0.267, -0.040], p = 0.008, Cohen’s d = -0.301), indicating greater homogeneity (Figure 1h). Note, a paired t-test on the 173 common participants demonstrated a significant effect of modality on gradient variability (t(172) = -56.639, p = 3.663 x 10<sup>-113</sup>), such that the mean variability of communicability gradients (M = 0.033, SD = 0.001) was less than half that of functional connectivity (M = 0.076, SD = 0.010). Together, this suggests that principal functional connectivity and communicability gradients are established early in childhood and display age-related refinement, but not replacement”. 

      The Issue of Abstraction and Benefits of the Gradient-Based View: The manuscript interprets the eccentricity findings as reflecting changes along the segregation-integration spectrum. Given this, it is unclear why a more straightforward analysis using established graph-theory metrics of segregationintegration was not pursued instead. Mapping gradients and computing eccentricity adds layers of abstraction and complexity. If similar interpretations can be derived directly from simpler graph metrics, what additional insights does the gradient-based framework offer? While the manuscript argues that this approach provides “a more unifying account of cortical reorganization”, it is not evident why this abstraction is necessary or advantageous over traditional graph metrics. Clarifying these benefits would strengthen the rationale for using this method. 

      This is a great point, and something we spent quite a bit of time considering when designing the analysis. The central goal of our project was to identify gradients of brain organisation across different datasets and modalities and then test how the organisational principles of those modalities align. In other words, how do structural and functional ‘spaces’ intersect, and does this vary across the cortex? That for us was the primary motivation for operationalising organisation as nodal location within a low-dimensional manifold space (Bethlehem et al., 2020; Gale et al., 2022; Park et al., 2021), using a simple composite measure to achieve compression, rather than as a series of graph metrics. The reason we subsequently calculated those graph metrics and tested for their association was simply to help us interpret what eccentricity within that lowdimensional space means. Manifold eccentricity was moderately positively correlated to graph-theory metrics of integration, leaving a substantial portion of variance unaccounted for, but that association we think is nonetheless helpful for readers trying to interpret eccentricity. However, since ME tells us about the relative position of a node in that low-dimensional space, it is also likely capturing elements of multiple graph theory measures. Following the Reviewer’s question, this is something we decided to test. Specifically, using 4 measures of segregation, including two new metrics requested by the Reviewer in a minor point (weighted clustering coefficient and normalized degree centrality), we conducted a dominance analysis (Budescu, 1993) with normalized manifold eccentricity of the group-level referred CALM structural connectome. We also detail the use of gradient measures in developmental contexts, and how they can be complementary to traditional graph theory metrics. 

      We have added the following to the Results section (Page 10, Lines 472 onwards): 

      “To further contextualise manifold eccentricity in terms of integration and segregation beyond simple correlations, we conducted a multivariate dominance analysis (Budescu, 1993) of four graph theory metrics of segregation as predictors of nodal normalized manifold eccentricity within the group-level referred CALM structural and functional connectomes (Figure 2c). A dominance analysis assesses the relative importance of each predictor in a multilinear regression framework by fitting 2<sup>n</sup> – 1 models (where n is the number of predictors) and calculating the relative increase in adjusted R2 caused by adding each predictor to the model across both main effects and interactions. A multilinear regression model including weighted clustering coefficient, within-module degree Z-score, participation coefficient and normalized degree centrality accounted for 59% of variance in nodal manifold eccentricity in the group-level CALM structural connectome. Withinmodule degree Z score was the most important predictor (40.31% dominance), almost twice that of the participation coefficient (24.03% dominance) and normalized degree centrality (24.05% dominance) which made roughly equal contributions. The least important predictor was the weighted clustering coefficient (11.62% dominance). When the same approach was applied for the group-level referred CALM functional connectome, the 4 predictors accounted for 52% variability. However, in contrast to the structural connectome, functional manifold eccentricity seemed to incorporate the same graph theory metrics in different proportions. Normalized degree centrality was the most important predictor (47.41% dominance), followed by withinmodule degree Z-score (24.27%), and then the participation coefficient (15.57%) and weighted clustering coefficient (12.76%) which made approximately equal contributions. Thus, whilst structural manifold eccentricity was dominated most by within-module degree Z-score and least by the weighted clustering coefficient, functional manifold eccentricity was dominated most by normalized degree centrality and least by the weighted clustering coefficient. This suggests that manifold mapping techniques incorporate different aspects of integration dependent on modality. Together, manifold eccentricity acts as a composite measure of segregation, being differentially sensitive to different aspects of segregation, without necessitating a priori specification of graph theory metrics. Further discussion of the value of gradient-based metrics in developmental contexts and as a supplement to traditional graph theory analyses is provided in the ‘Manifold Eccentricity’ methodology sub-section”. 

      We added further justification to the manifold eccentricity Methods subsection (Page 26, line 1283):

      “Gradient-based measures hold value in developmental contexts, above and beyond traditional graph theory metrics: within a sample of over 600 cognitively-healthy adults aged between 18 and 88 years old, sensitivity of gradient-based within-network functional dispersion to age were stronger and more consistent across networks compared to segregation (Bethlehem et al., 2020). In the context of microstructural profile covariance, modules resolved by Louvain community detection occupied distinct positions across the principal two gradients, suggesting that gradients offer a way to meaningfully order discrete graph theory analyses (Paquola et al., 2019)”. 

      We added the following to the Introduction section outlining the application of gradients as cortex-wide coordinate systems (Page 3, Line 121):

      “Using the gradient-based approach as a compression tool, thus forgoing the need to specify singular graph theory metrics a priori, we operationalised individual variability in low-dimensional manifolds as eccentricity (Gale et al., 2022; Park et al., 2021). Crucially, such gradients appear to be useful predictors of phenotypic variation, exceeding edge-level connectomics. For example, in the case of functional connectivity gradients, their predictive ability for externalizing symptoms and general cognition in neurotypical adults surpassed that of edge-level connectome-based predictive modelling (Hong et al., 2020), suggesting that capturing lowdimensional manifolds may be particularly powerful biomarkers of psychopathology and cognition”. 

      We also added the following to the Discussion section (Page 18, Line 839):

      “By capitalising on manifold eccentricity as a composite measure of segregation across development, we build upon an emerging literature pioneering gradients as a method to establish underlying principles of structural (Paquola et al., 2020; Park et al., 2021) and functional (Dong et al., 2021; Margulies et al., 2016; Xia et al., 2022) brain development without a priori specification of specific graph theory metrics of interest”. 

      It is unclear whether the statistical tests finding significant dataset effects are capturing effects of neurotypical vs. Neurodivergent, or simply different scanners/sites. Could the neurotypical portion of CALM also be added to distinguish between these two sources of variability affecting dataset effects (i.e. ideally separating this to the effect of site vs. neurotypicality would better distinguish the effect of neurodivergence).

      At a group-level, differences in the gradients between the two cohorts are very minor. Indeed, in the manuscript we describe these gradients as being seemingly ‘universal’. But we agree that we should test whether we can directly attribute any simple main effects of ‘dataset’ are resulting from the different site or the phenotype of the participants. The neurotypical portion of CALM (collected at the same site on the same scanner) helped us show that any minor differences in the gradient alignments is likely due to the site/scanner differences rather than the phenotype of the participants. We took the same approach for testing the simple main effects of dataset on manifold eccentricity. To better parse neurotypicality and site effects at an individual-level, we conducted a series of sensitivity analyses. First, in response to the reviewer’s earlier comment, we conducted a series of nodal generalized linear models for communicability and FC gradients derived from neurotypical and neurodivergent portions of CALM, alongside NKI, and tested for an effect of neurotypicality above and beyond scanner. As at the group level, having those additional scans on a ‘comparison’ sample for CALM is very helpful in teasing apart these effects. We find that neurotypicality affects communicability gradient expression to a greater degree than functional connectivity. We visualised these results and added them to Figure 1. Second, we used the same approach but for manifold eccentricity. Again, we demonstrate greater sensitivity of neurotypicality to communicability at a global-level, but we cannot pin these effects down to specific networks because the effects do not survive the necessary multiple comparison correction. We have added these analyses to the manuscript (Page 13, Line 583): 

      “Much as with the gradients themselves, we suspected that much of the simple main effect of dataset could reflect the scanner / site, rather than the difference in phenotype. Again, we drew upon the CALM comparison children to help us disentangle these two explanations. As a sensitivity analysis to parse effects of neurotypicality and dataset on manifold eccentricity, we conducted a series of generalized linear models predicting mean global and network-level manifold eccentricity, for each modality. We did this across all the baseline data (i.e. including the neurotypical comparison sample for CALM) using neurotypicality (2 levels: neurodivergent or neurotypical), site (2 levels: CALM or NKI), sex, head motion, and age at scan (Figure 3X). We restricted our analysis to baseline scans to create more equally-balanced groups. In terms of structural manifold eccentricity (N = 313 neurotypical, N = 311 neurodivergent), we observed higher manifold eccentricity in the neurodivergent participants at a global level (β = 0.090, p = 0.019, Cohen’s d = 0.188) but the individual network level effects did not survive the multiple comparison correction necessary for looking across all seven networks, with the default-mode network being the strongest (β = 0.135, p = 0.027, p<sub>FDR</sub> = 0.109, Cohen’s d = 0.177). There was no significant effect of neurodiversity on functional manifold eccentricity (N = 292 neurotypical and N = 177 neurodivergent). This suggests that neurodiversity is significantly associated with structural manifold eccentricity, over and above differences in site, but we cannot distinguish these effects reliably in the functional manifold data”. 

      Third, we removed the Scheirer-Ray-Hare test from the results for two reasons. First, its initial implementation did not account for repeated measures, and therefore non-independence between observations, as the same participants may have contributed both structural and functional data. Second, if we wanted to repeat this analysis in CALM using the referred and control portions, a significant difference in group size existed, which may affect the measures of variability. Specifically, for baseline CALM, 311 referred and 91 control participants contributed SC data, whilst 177 referred and 79 control participants contributed FC data. We believe that the ‘cleanest’ parsing of dataset and site for effects of eccentricity is achieved using the GLMs in Figure 3. 

      We observed no significant effect of neurodivergence on the magnitude of structure-function coupling across development, and have added the following text (Page 14, Line 632):

      “To parse effects of neurotypicality and dataset on structure-function coupling, we conducted a series of generalized linear models predicting mean global and network-level coupling using neurotypicality, site, sex, head motion, and age at scan, at baseline (N = 77 CALM neurotypical, N = 173 CALM neurodivergent, and N = 170 NKI). However, we found no significant effects of neurotypicality on structure-function coupling across development”. 

      Since we demonstrated no significant effects of neurotypicality on structure-function coupling magnitude across development, but found differential dataset-specific effects of age on coupling development, we added the following sentence at the end of the coupling trajectory results sub-section (Page 14, line 664):

      “Together, these effects demonstrate that whilst the magnitude of structure-function coupling appears not to be sensitive to neurodevelopmental phenotype, its development with age is, particularly in higher-order association networks, with developmental change being reduced in the neurodivergent sample”.  

      Figure 1.c: A non-parametric permutation test (e.g. Mann-Whitney U test) could quantitatively identify regions with significant group differences in nodal gradient values, providing additional support for the qualitative findings.

      This is a great idea. To examine the effect of referral status on nodal gradient values, whilst controlling for covariates (head motion and sex), we conducted a series of generalised linear models. We opted for this instead of a Mann-Whitney U test, as the former tests for differences in distributions, whilst the direction of the t-statistic for referral status from the GLM would allow us to specify the magnitude and direction of differences in nodal gradient values between the two groups. Again, we conducted this in CALM (referred vs control), at an individual-level, as downstream analyses suggested a main effect of dataset (which is reflected in the highly-similar group-level referred and control CALM gradients). We have updated the Results section with the following text (Page 6, Line 283):

      “To examine the effect of referral status on participant-level nodal gradient values in CALM, we conducted a series of generalized linear models controlling for head motion, sex and age at scan (Figure 1d). We restricted our analyses to baseline scans to reduce the difference in sample size for the referred (311 communicability and 177 functional gradients, respectively) and control participants (91 communicability and 79 functional gradients, respectively), and to the principal gradients. For communicability, 42 regions showed a significant effect (p < 0.05) of neurodivergence before FDR correction, with 9 post FDR correction. 8 of these 9 regions had negative t-statistics, suggesting a reduced nodal gradient value and representation in the neurodivergent children, encompassing both lower-order somatosensory cortices alongside higher-order fronto-parietal and default-mode networks. The largest reductions were observed within the prefrontal cortices of the defaultmode network (t = -3.992, p = 6.600 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.013, Cohen’s d = -0.476), the left orbitofrontal cortex of the limbic network (t = -3.710, p = 2.070 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.442) and right somato-motor cortex (t = -3.612, p = 3.040 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.431). The right visual cortex was the only exception, with stronger gradient representation within the neurotypical cohort (t = 3.071, p = 0.002, p<sub>FDR</sub> = 0.048, Cohen’s d = 0.366). For functional connectivity, comparatively fewer regions exhibited a significant effect (p < 0.05) of neurotypicality, with 34 regions prior to FDR correction and 1 post. Significantly stronger gradient representation was observed in neurotypical children within the right precentral ventral division of the defaultmode network (t = 3.930, p = 8.500 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.017, Cohen’s d = 0.532). Together, this suggests that the strongest and most robust effects of neurodivergence are observed within gradients of communicability, rather than functional connectivity, where alterations in both affect higher-order associative regions”. 

      In the harmonization methodology, it is mentioned that “if harmonisation was successful, we’d expect any significant effects of scanner type before harmonisation to be non-significant after harmonisation”. However, given that there were no significant effects before harmonization, the results reported do not help in evaluating the quality of harmonization.

      We agree with the Reviewer, and have removed the post-harmonisation GLMs, and instead stating that there were no significant effects of scanner type before harmonization. 

      Figure 3: It would be helpful to include a plot showing the GAMM predictions versus real observations of eccentricity (x-axis: predictions, y-axis: actual values). 

      To plot the GAMM-predicted smooth effects of age, which we used for visualisation purposes only, we used the get_predictions function from the itsadug R package. This creates model predictions using the median value of nuisance covariates. Thus, whilst we specified the entire age range, the function automatically chooses the median of head motion, alongside controlling for sex (default level: male) and, for each dataset-specific trajectory. Since the gamm4 package separates the fitted model into a gam and linear mixed effects model (which accounts for participant ID as a random effect), and the get_predictions function only uses gam, random effects are not modelled in the predicted smooths. Therefore, any discrepancy between the observed and predicted manifold eccentricity values is likely due to sensitivity to default choices of covariates other than age, or random effects. To prevent Figure 3 being too over-crowded, we opted to not include the predictions: these were strongly correlated with real structural manifold data, but less for functional manifold data especially where significant developmental change was absent.

      The 30mm threshold for filtering short streamlines in tractography is uncommon. What is the rationale for using such a large threshold, given the potential exclusion of many short-range association fibres?

      A minimum length of 30mm was the default for the MRtrix3 reconstruction workflow, and something we have previously used. In a previous project, we systematically varied the minimum fibre length and found that this had minimal impact on network organisation (e.g. Mousley et al. 2025). However, we accept that short-range association fibres may have been excluded and have included this in the Discussion as a methodological limitation, alongside our predictions for how the gradients and structure-function coupling may’ve been altered had we included such fibres (Page 20, Line 955):

      “A potential methodological limitation in the construction of structural connectomes was the 30mm tract length threshold which, despite being the QSIprep reconstruction default (Cieslak et al., 2021), may have potentially excluded short-range association fibres. This is pertinent as tracts of different lengths exhibit unique distributions across the cortex and functional roles (Bajada et al., 2019) : short-range connections occur throughout the cortex but peak within primary areas, including the primary visual, somato-motor, auditory, and para-hippocampal cortices, and are thought to dominate lower-order sensorimotor functional resting-state networks, whilst long-range connections are most abundant in tertiary association areas and are recruited alongside tracts of varying lengths within higher-order functional resting-state networks. Therefore, inclusion of short-range association fibres may have resulted in a relative increase in representation of lower-order primary areas and functional networks. On the other hand, we also note the potential misinterpretation of short-range fibres: they may be unreliably distinguished from null models in which tractography is restricted by cortical gyri only (Bajada et al., 2019). Further, prior (neonatal) work has demonstrated that the order of connectivity of regions and topological fingerprints are consistent across varying streamline thresholds (Mousley et al., 2025), suggesting minimal impact”. 

      Given the spatial smoothing of fMRI data (6mm FWHM), it would be beneficial to apply connectome spatial smoothing to structural connectivity measures for consistent spatial smoothness.

      This is an interesting suggestion but given we are looking at structural communicability within a parcellated network, we are not sure that it would make any difference. The data structural data are already very smooth. Nonetheless we have added the following text to the Discussion (Page 20, Line 968): 

      “Given the spatial smoothing applied to the functional connectivity data, and examining its correspondence to streamline-count connectomes through structure-function coupling, applying the equivalent smoothing to structural connectomes may improve the reliability of inference, and subsequent sensitivity to cognition and psychopathology. Connectome spatial smoothing involves applying a smoothing kernel to the two streamline endpoints, whereby variations in smoothing kernels are selected to optimise the trade-off between subjectlevel reliability and identifiability, thus increasing the signal-to-noise ratio and the reliability of statistical inferences of brain-behaviour relationships (Mansour et al., 2022). However, we note that such smoothing is more effective for high-resolution connectomes, rather than parcel-level, and so have only made a modest improvement (Mansour et al., 2022)”.

      Why was harmonization performed only within the CALM dataset and not across both CALM and NKI datasets? What was the rationale for this decision?

      We thought about this very carefully. Harmonization aims to remove scanner or site effects, whilst retaining the crucial characteristics of interest. Our capacity to retain those characteristics is entirely dependent on them being *fully* captured by covariates, which are then incorporated into the harmonization process. Even with the best set of measures, the idea that we can fully capture ‘neurodivergence’ and thus preserve it in the harmonisation process is dubious. Indeed, across CALM and NKI there are limited number of common measures (i.e. not the best set of common measures), and thus we are limited in our ability to fully capture the neurodivergence with covariates. So, we worried that if we put these two very different datasets into the harmonisation process we would essentially eliminate the interesting differences between the datasets. We have added this text to the harmonization section of the Methods (Page 24, Line 1225):

      “Harmonization aims to retain key characteristics of interest whilst removing scanner or site effects. However, the site effects in the current study are confounded with neurodivergence, and it is unlikely that neurodivergence may be captured fully using common covariates across CALM and NKI. Therefore, to preserve variation in neurodivergence, whilst reducing scanner effects, we harmonized within the CALM dataset only”. 

      The exclusion of subcortical areas from connectivity analyses is not justified. 

      This is a good point. We used the Schaefer atlas because we had previously used this to derive both functional and structural connectomes, but we agree that it would have been good to include subcortical areas (Page 20, Line 977). 

      “A potential limitation of our study was the exclusion of subcortical regions. However, prior work has shed light on the role of subcortical connectivity in structural and functional gradients, respectively, of neurotypical populations of children and adolescents (Park et al., 2021; Xia et al., 2022). For example, in the context of the primary-to-transmodal and sensorimotor-to-visual functional connectivity gradients, the mean gradient scores within subcortical networks were demonstrated to be relatively stable across childhood and adolescence (Xia et al., 2022). In the context of structural connectivity gradients derived from streamline counts, which we demonstrated were highly consistent with those derived from communicability, subcortical structural manifolds weighted by their cortical connectivity were anchored by the caudate and thalamus at one pole, and by the hippocampus and nucleus accumbens at the opposite pole, with significant age-related manifold expansion within the caudate and thalamus (Park et al., 2021)”. 

      In the KNN imputation method, were uniform weights used, or was an inverse distance weighting applied?

      Uniform weights were used, and we have updated the manuscript appropriately.

      The manuscript should clarify from the outset that the reported sample size (N) includes multiple longitudinal observations from the same individuals and does not reflect the number of unique participants.

      We have rectified the Abstract (Page 2, Line 64) and Introduction (Page 3, Line 138):

      “We charted the organisational variability of structural (610 participants, N = 390 with one observation, N = 163 with two observations, and N = 57 with three) and functional (512 participants, N = 340 with one observation, N = 128 with two observations, and N = 44 with three)”.

      The term “structural gradients” is ambiguous in the introduction. Clarify that these gradients were computed from structural and functional connectivity matrices, not from other structural features (e.g. cortical thickness).

      We have clarified this in the Introduction (Page 3, Line 134):

      “Applying diffusion-map embedding as an unsupervised machine-learning technique onto matrices of communicability (from streamline SIFT2-weighted fibre bundle capacity) and functional connectivity, we derived gradients of structural and functional brain organisation in children and adolescents…”

      Page 5: The sentence, “we calculated the normalized angle of each structural and functional connectome to derive symmetric affinity matrices” is unclear and needs clarification.

      We have clarified this within the second paragraph of the Results section (Page 4, Line 185):

      “To capture inter-nodal similarity in connectivity, using a normalised angle kernel, we derived individual symmetric affinity matrices from the left and right hemispheres of each communicability and functional connectivity matrix. Varying kernels capture different but highly-related aspects of inter-nodal similarity, such as correlation coefficients, Gaussian kernels, and cosine similarity. Diffusion-map embedding is then applied on the affinity matrices to derive gradients of cortical organisation”. 

      Figure 1.a: “Affine A” likely refers to the affinity matrix. The term “affine” may be confusing; consider using a clearer label. It would also help to add descriptive labels for rows and columns (e.g. region x region).

      Thank you for this suggestion! We have replaced each of the labels with “pairwise similarity”. We also labelled the rows and columns as regions.

      Figure 1.d: Are the cross-group differences statistically significant? If so, please indicate this in the figure.

      We have added the results of a series of linear mixed effects models to the legend of Figure 1 (Page 6, line 252):

      “indicates a significant effect of dataset (p < 0.05) on variance explained within a linear mixed effects model controlling for head motion, sex, and age at scan”.

      The sentence “whose connectomes were successfully thresholded” in the methods is unclear. What does “successfully thresholded” mean? Additionally, this seems to be the first mention of the Schaefer 100 and Brainnetome atlas; clarify where these parcellations are used. 

      We have amended the Methodology section (Page 23, Line 1138):

      “For each participant, we retained the strongest 10% of connections per row, thus creating fully connected networks required for building affinity matrices. We excluded any connectomes in which such thresholding was not possible due to insufficient non-zero row values. To further ensure accuracy in connectome reconstruction, we excluded any participants whose connectomes failed thresholding in two alternative parcellations: the 100node Schaefer 7-network (Schaefer et al., 2018) and Brainnetome 246-node (Fan et al., 2016) parcellations, respectively”. 

      We have also specified the use of the Schaefer 200-node parcellation in the first sentence on the second Results paragraph.

      The use of “streamline counts” is misleading, as the method uses SIFT2-weighted fibre bundle capacity rather than raw streamline counts. It would be better to refer to this measure as “SIFT2-weighted fibre bundle capacity” or “FBC”.

      We replaced all instances of “streamline counts” with “SIFT2-weighted fibre bundle capacity” as appropriate.

      Figure 2.c: Consider adding plots showing changes in eccentricity against (1) degree centrality, and (2) weighted local clustering coefficient. Additionally, a plot showing the relationship between age and mean eccentricity (averaged across nodes) at the individual level would be informative.

      We added the correlation between eccentricity and both degree centrality and the weighted local clustering coefficient and included them in our dominance analysis in Figure 2. In terms of the relationship between age and mean (global) eccentricity, these are plotted in Figure 3. 

      Figure 2.b: Considering the results of the following sections, it would be interesting to include additional KDE/violin plots to show group differences in the distribution of eccentricity within 7 different functional networks.

      As part of our analysis to parse neurotypicality and dataset effects, we tested for group differences in the distribution of structural and functional manifold eccentricity within each of the 7 functional networks in the referred and control portions of CALM and have included instances of significant differences with a coloured arrow to represent the direction of the difference within Figure 3. 

      Figure 3: Several panels lack axis labels for x and y axes. Adding these would improve clarity.

      To minimise the amount of text in Figure 3, we opted to include labels only for the global-level structural and functional results. However, to aid interpretation, we added a small schematic at the bottom of Figure 3 to represent all axis labels. 

      The statement that “differences between datasets only emerged when taking development into account” seems inaccurate. Differences in eccentricity are evident across datasets even before accounting for development (see Fig 2.b and the significance in the Scheirer-Ray-Hare test).

      We agree – differences in eccentricity across development and datasets are evident in structural and functional manifold eccentricity, as well as within structure-function coupling. However, effects of neurotypicality were particularly strong for the maturation of structure-function coupling, rather than magnitude. Therefore, we have rephrased this sentence in the Discussion (page 18, line 832):

      “Furthermore, group-level structural and functional gradients were highly consistent across datasets, whilst differences between datasets were emphasised when taking development into account, through differing rates of structural and functional manifold expansion, respectively, alongside maturation of structure-function coupling”.

      The handling of longitudinal data by adding a random effect for individuals is not clear in the main text. Mentioning this earlier could be helpful. 

      We have included this detail in the second sentence of the “developmental trajectories of structural manifold contraction and functional manifold expansion” results sub-section (page 11, line 503):

      “We included a random effect for each participant to account for longitudinal data”. 

      Figure 4.b: Why were ranks shown instead of actual coefficient of variation values? Consider including a cortical map visualization of the coefficients in the supplementary material.

      We visualised the ranks, instead of the actual coefficient of variation (CV) values, due to considerable variability and skew in the magnitude of the CV, ranging from 28.54 (in the right visual network) to 12865.68 (in the parietal portion of the left default-mode network), with a mean of 306.15. If we had visualised the raw CV values, these larger values would’ve been over-represented. We’ve also noticed and rectified an error in the labelling of the colour bar for Figure 4b: the minimum should be most variable (i.e. a rank of 1). To aid contextualisation of the ranks, we have added the following to the Results (page 14, line 626):

      “The distribution of cortical coefficients of variation (CV) varied considerably, with the largest CV (in the parietal division of the left default-mode network) being over 400 times that of the smallest (in the right visual network). The distribution of absolute CVs was positively skewed, with a Fisher skewness coefficient g<sub>1</sub> of 7.172, meaning relatively few regions had particularly high inter-individual variability, and highly peaked, with a kurtosis of 54.883, where a normal distribution has a skewness coefficient of 0 and a kurtosis of 3”. 

      Reviewer #2 (Public review):

      Some differences in developmental trajectories between CALM and NKI (e.g. Figure 4d) are not explained. Are these differences expected, or do they suggest underlying factors that require further investigation?

      This is a great point, and we appreciate the push to give a fuller explanation. It is very hard to know whether these effects are expected or not. We certainly don’t know of any other papers that have taken this approach. In response to the reviewer’s point, we decided to run some more analyses to better understand the differences. Having observed stronger age effects on structure-function coupling within the neurotypical NKI dataset, compared to the absent effects in the neurodivergent portion of CALM, we wanted to follow up and test that it really is that coupling is more sensitive to the neurodivergent versus neurotypical difference between CALM and NKI (rather than say, scanner or site effects). In short, we find stronger developmental effects of coupling within the neurotypical portion of CALM, rather than neurodivergent, and have added this to the Results (page 15, line 701):

      “To further examine whether a closer correspondence of structure-function coupling with age is associated with neurotypicality, we conducted a follow-up analysis using the additional age-matched neurotypical portion of CALM (N = 77). Given the widespread developmental effects on coupling within the neurotypical NKI sample, compared to the absent effects in the neurodivergent portion of CALM, we would expect strong relationships between age and structure-function coupling with the neurotypical portion of CALM. This is indeed what we found: structure-function coupling showed a linear negative relationship with age globally (F = 16.76, p<sub>FDR</sub> < 0.001, adjusted R<sup>2</sup> = 26.44%), alongside fronto-parietal (F = 9.24, p<sub>FDR</sub> = 0.004, adjusted R<sup>2</sup> = 19.24%), dorsalattention (F = 13.162, p<sub>FDR</sub> = 0.001, adjusted R<sup>2</sup>= 18.14%), ventral attention (F = 11.47, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 22.78), somato-motor (F = 17.37, p<sub>FDR</sub>  < 0.001, adjusted R<sup>2</sup>= 21.92%) and visual (F = 11.79, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 20.81%) networks. Together, this supports our hypothesis that within neurotypical children and adolescents, structure-function coupling decreases with age, showing a stronger effect compared to their neurodivergent counterparts, in tandem with the emergence of higher-order cognition. Thus, whilst the magnitude of structure-function coupling across development appeared insensitive to neurotypicality, its maturation is sensitive. Tentatively, this suggests that neurotypicality is linked to stronger and more consistent maturational development of structure-function coupling, whereby the tethering of functional connectivity to structure across development is adaptive”. 

      In conjunction with the Reviewer’s later request to deepen the Discussion, we have included an additional paragraph attempting to explain the differences in neurodevelopmental trajectories of structure-function coupling (Page 19, Line 924):

      “Whilst the spatial patterning of structure-function coupling across the cortex has been extensively documented, as explained above, less is known about developmental trajectories of structure-function coupling, or how such trajectories may be altered in those with neurodevelopmental conditions. To our knowledge, only one prior study has examined differences in developmental trajectories of (non-manifold) structure-function coupling in typically-developing children and those with attention-deficit hyperactivity disorder (Soman et al., 2023), one of the most common conditions in the neurodivergent portion of CALM. Namely, using cross-sectional and longitudinal data from children aged between 9 and 14 years old, they demonstrated increased coupling across development in higher-order regions overlapping with the defaultmode, salience, and dorsal attention networks, in children with ADHD, with no significant developmental change in controls, thus encompassing an ectopic developmental trajectory (Di Martino et al., 2014; Soman et al., 2023). Whilst the current work does not focus on any condition, rather the broad mixed population of young people with neurodevelopmental symptoms (including those with and without diagnoses), there are meaningful individual and developmental differences in structure-coupling. Crucially, it is not the case that simply having stronger coupling is desirable. The current work reveals that there are important developmental trajectories in structure-function coupling, suggesting that it undergoes considerable refinement with age. Note that whilst the magnitude of structure-function coupling across development did not differ significantly as a function of neurodivergence, its relationship to age did. Our working hypothesis is that structural connections allow for the ordered integration of functional areas, and the gradual functional modularisation of the developing brain. For instance, those with higher cognitive ability show a stronger refinement of structurefunction coupling across development. Future work in this space needs to better understand not just how structural or functional organisation change with time, but rather how one supports the other”. 

      The use of COMBAT may have excluded extreme participants from both datasets, which could explain the lack of correlations found with psychopathology.

      COMBAT does not exclude participants from datasets but simply adjusts connectivity estimates. So, the use of COMBAT will not be impacting the links with psychopathology by removing participants. But this did get us thinking. Excluding participants based on high motion may have systematically removed those with high psychopathology scores, meaning incomplete coverage. In other words, we may be under-representing those at the more extreme end of the range, simply because their head-motion levels are higher and thus are more likely to be excluded. We found that despite certain high-motion participants being removed, we still had good coverage of those with high scores and were therefore sensitive within this range. We have added the following to the revised Methods section (Page 26, Line 1338):

      “As we removed participants with high motion, this may have overlapped with those with higher psychopathology scores, and thus incomplete coverage. To examine coverage and sensitivity to broad-range psychopathology following quality control, we calculated the Fisher-Pearson skewness statistic g<sub>1</sub> for each of the 6 Conners t-statistic measures and the proportion of youth with a t-statistic equal to or greater than 65, indicating an elevated or very elevated score. Measures of inattention (g<sub>1</sub> = 0.11, 44.20% elevated), hyperactivity/impulsivity (g<sub>1</sub> = 0.48, 36.41% elevated), learning problems (g<sub>1</sub> = 0.45, 37.36% elevated), executive functioning (g<sub>1</sub> = 0.27, 38.16% elevated), aggression (g<sub>1</sub> = 1.65, 15.58% elevated), and peer relations (g<sub>1</sub> = 0.49, 38% elevated) were positively skewed and comprised of at least 15% of children with elevated or very elevated scores, suggesting sufficient coverage of those with extreme scores”. 

      There is no discussion of whether the stable patterns of brain organization could result from preprocessing choices or summarizing data to the mean. This should be addressed to rule out methodological artifacts. 

      This is a brilliant point. We are necessarily using a very lengthy pipeline, with many design choices to explore structural and functional gradients and their intersection. In conjunction with the Reviewer’s later suggestion to deepen the Discussion, we have added the following paragraph which details the sensitivity analyses we carried out to confirm the observed stable patterns of brain organization (Page 18, Line 863):

      “That is, whilst we observed developmental refinement of gradients, in terms of manifold eccentricity, standard deviation, and variance explained, we did not observe replacement. Note, as opposed to calculating gradients based on group data, such as a sliding window approach, which may artificially smooth developmental trends and summarise them to the mean, we used participant-level data throughout. Given the growing application of gradient-based analyses in modelling structural (He et al., 2025; Li et al., 2024) and functional (Dong et al., 2021; Xia et al., 2022) brain development, we hope to provide a blueprint of factors which may affect developmental conclusions drawn from gradient-based frameworks”.

      Although imputing missing data was necessary, it would be useful to compare results without imputed data to assess the impact of imputation on findings. 

      It is very hard to know the impact of imputation without simply removing those participants with some imputed data. Using a simulation experiment, we expressed the imputation accuracy as the root mean squared error normalized by the range of observable data in each scale. This produced a percentage error margin. We demonstrate that imputation accuracy across all measures is at worst within approximately 11% of the observed data, and at best within approximately 4% of the observed data, and have included the following in the revised Methods section (Page 27, Line 1348):

      “Missing data

      To avoid a loss of statistical power, we imputed missing data. 27.50% of the sample had one or more missing psychopathology or cognitive measures (equal to 7% of all values), and the data was not missing at random: using a Welch’s t-test, we observed a significant effect of missingness on age [t (264.479) = 3.029, p = 0.003, Cohen’s d = 0.296], whereby children with missing data (M = 12.055 years, SD = 3.272) were younger than those with complete data (M = 12.902 years, SD = 2.685). Using a subset with complete data (N = 456), we randomly sampled 10% of the values in each column with replacement and assigned those as missing, thereby mimicking the proportion of missingness in the entire dataset. We conducted KNN imputation (uniform weights) on the subset with complete data and calculated the imputation accuracy as the root mean squared error normalized by the observed range of each measure. Thus, each measure was assigned a percentage which described the imputation margin of error. Across cognitive measures, imputation was within a 5.40% mean margin of error, with the lowest imputation error in the Trail motor speed task (4.43%) and highest in the Trails number-letter switching task (7.19%). Across psychopathology measures, imputation exhibited a mean 7.81% error margin, with the lowest imputation error in the Conners executive function scale (5.75%) and the highest in the Conners peer relations scale (11.04%). Together, this suggests that imputation was accurate”.

      The results section is extensive, with many reports, while the discussion is relatively short and lacks indepth analysis of the findings. Moving some results into the discussion could help balance the sections and provide a deeper interpretation. 

      We agree with the Reviewer and appreciate the nudge to expand the Discussion section. We have added 4 sections to the Discussion. The first explores the importance of the default-mode network as a region whose coupling is most consistently predicted by working memory across development and phenotypes, in terms of its underlying anatomy (Paquola et al., 2025) (Page 20, Line 977):

      “An emerging theme from our work is the importance of the default-mode network as a region in which structure-function coupling is reliably predicted by working memory across neurodevelopmental phenotypes and datasets during childhood and adolescence. Recent neurotypical adult investigations combining highresolution post-mortem histology, in vivo neuroimaging, and graph-theory analyses have revealed how the underlying neuroanatomy of the default-mode network may support diverse functions (Paquola et al., 2025), and thus exhibit lower structure-function coupling compared to unimodal regions. The default-mode network has distinct neuroanatomy compared to the remaining 6 intrinsic resting-state functional networks (Yeo et al., 2011), containing a distinctive combination of 5 of the 6 von Economo and Koskinas cell types (von Economo & Koskinas, 1925), with an over-representation of heteromodal cortex, and uniquely balancing output across all cortical types. A primary cytoarchitectural axis emerges, beyond which are mosaic-like spatial topographies. The duality of the default-mode network, in terms of its ability to both integrate and be insulated from sensory information, is facilitated by two microarchitecturally distinct subunits anchored at either end of the cytoarchitectural axis (Paquola et al., 2025). Whilst beyond the scope of the current work, structure-function coupling and their predictive value for cognition may also differ across divisions within the default-mode network, particularly given variability in the smoothness and compressibility of cytoarchitectural landscapes across subregions (Paquola et al., 2025)”. 

      The second provides a deeper interpretation and contextualisation of greater sensitivity of communicability, rather than functional connectivity, to neurodivergence (Page 19, Lines 907):

      “We consider two possible factors to explain the greater sensitivity of neurodivergence to gradients of communicability, rather than functional connectivity. First, functional connectivity is likely more sensitive to head motion than structural-based communicability and suffers from reduced statistical power due to stricter head motion thresholds, alongside greater inter-individual variability. Second, whilst prior work contrasting functional connectivity gradients from neurotypical adults with those with confirmed ASD diagnoses demonstrated vertex-level reductions in the default-mode network in ASD and marginal increases in sensorymotor communities (Hong et al., 2019), indicating a sensitivity of functional connectivity to neurodivergence, important differences remain. Specifically, whilst the vertex-level group-level differences were modest, in line with our work, greater differences emerged when considering step-wise functional connectivity (SFC); in other words, when considering the dynamic transitions of or information flow through the functional hierarchy underlying the static functional connectomes, such that ASD was characterised by initial faster SFC within the unimodal cortices followed by a lack of convergence within the default-mode network (Hong et al., 2019). This emphasis on information flow and dynamic underlying states may point towards greater sensitivity of neurodivergence to structural communicability – a measure directly capturing information flow – than static functional connectivity”. 

      The third paragraph situates our work within a broader landscape of reliable brain-behaviour relationships, focusing on the strengths of combining clinical and normative samples to refine our interpretation of the relationship between gradients and cognition, as well as the importance of equifinality in developmental predictive work (Page 20, line 994):

      “In an effort to establish more reliable brain-behaviour relationships despite not having the statistical power afforded by large-scale, typically normative, consortia (Rosenberg & Finn, 2022), we demonstrated the development-dependent link between default-mode structure-function coupling and working memory generalised across clinical (CALM) and normative (NKI) samples, across varying MRI acquisition parameters, and harnessing within- and across-participant variation. Such multivariate associations are likely more reliable than their univariate counterparts (Marek et al., 2022), but can be further optimised using task-related fMRI (Rosenberg & Finn, 2022). The consistency, or lack of, of developmental effects across datasets emphasises the importance of validating brain-behaviour relationships in highly diverse samples. Particularly evident in the case of structure-function coupling development, through our use of contrasting samples, is equifinality (Cicchetti & Rogosch, 1996), a key concept in developmental neuroscience: namely, similar ‘endpoints’ of structure-function coupling may be achieved through different initialisations dependent on working memory. 

      The fourth paragraph details methodological limitations in response to Reviewer 1’s suggestions to justify the exclusion of subcortical regions and consider the role of spatial smoothing in structural connectome construction as well as the threshold for filtering short streamlines”. 

      While the methods are thorough, it is not always clear whether the optimal approaches were chosen for each step, considering the available data. 

      In response to Reviewer 1’s concerns, we conducted several sensitivity analyses to evaluate the robustness of our results in terms of procedure. Specifically, we evaluated the impact of thresholding (full or sparse), level of analysis (individual or group gradients), construction of the structural connectome (communicability or fibre bundle capacity), Procrustes rotation (alignment to group-level gradients before Procrustes), tracking the variance explained in individual connectomes by group-level gradients, impact of head motion, and distinguishing between site and neurotypicality effects. All these analyses converged on the same conclusion: whilst we observe some developmental refinement in gradients, we do not observe replacement. We refer the reviewer to their third point, about whether stable patterns of brain organization were artefactual. 

      The introduction is overly long and includes numerous examples that can distract readers unfamiliar with the topic from the main research questions. 

      We have removed the following from the Introduction, reducing it to just under 900 words:

      “At a molecular level, early developmental patterning of the cortex arises through interacting gradients of morphogens and transcription factors (see Cadwell et al., 2019). The resultant areal and progenitor specialisation produces a diverse pool of neurones, glia, and astrocytes (Hawrylycz et al., 2015). Across childhood, an initial burst in neuronal proliferation is met with later protracted synaptic pruning (Bethlehem et al., 2022), the dynamics of which are governed by an interplay between experience-dependent synaptic plasticity and genomic control (Gottlieb, 2007)”.

      “The trends described above reflect group-level developmental trends, but how do we capture these broad anatomical and functional organisational principles at the level of an individual?”

      We’ve also trimmed the second Introduction paragraph so that it includes fewer examples, such as removal of the wiring-cost optimisation that underlies structural brain development, as well as removing specific instances of network segregation and integration that occur throughout childhood.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Strengths: 

      The work uses a simple and straightforward approach to address the question at hand: is dynein a processive motor in cells? Using a combination of TIRF and spinning disc confocal microscopy, the authors provide a clear and unambiguous answer to this question. 

      Thank you for the recognition of the strength of our work

      Weaknesses: 

      My only significant concern (which is quite minor) is that the authors focus their analysis on dynein movement in cells treated with docetaxol, which could potentially affect the observed behavior. However, this is likely necessary, as without it, motility would not have been observed due to the 'messiness' of dynein localization in a typical cell (e.g., plus end-tracking in addition to cargo transport).

      You are exactly correct that this treatment was required to provided us a clear view of motile dynein and p50 puncta. One concern about the treatment that we had noted in our original submission was that the docetaxel derivative SiR tubulin could increase microtubule detyrosination, which has been implicated in affecting the initiation of dynein-dynactin motility but not motility rates (doi: 10.15252/embj.201593071). In response to a comment from reviewer 2 we investigated whether there was a significant increase in alpha-tubulin detyrosination in our treatment conditions and found that there was not. We have removed the discussion of this possibility from the revised version. Please also see response to comments raised by reviewer 2. 

      Reviewer 1 (Recommendations for the authors):

      Major points: 

      (1) The authors measured kinesin-1-GFP intensities in a different cell line (drosophila S2 cells) than what was used for the DHC and p50 measurements (HeLa cells). It is unclear if this provides a fair comparison given the cells provide different environments for the GFP. Although the differences may in fact be trivial, without somehow showing this is indeed a fair comparison, it should at least be noted as a caveat when interpreting relative intensity differences. Alternatively, the authors could compare DHC and p50 intensities to those measured from HeLa cells treated with taxol. 

      Thank you for this suggestion. We conducted new rounds of imaging with the DHCEGFP and p50-EGFP clones in conjunction with HeLa cells transiently expressing the human kinesin-1-EGFP and now present the datasets from the new experiments. Importantly, our new data was entirely consistent with the prior analyses as there was not a significant difference between the kinesin-1-EGFP dimer intensities and the DHC-EGFP puncta intensities and there was a statistically significant difference in the intensity of p50 puncta, which were approximately half the intensity of the kinesin-1 and DHC. We have moved the old data comparing the intensities in S2 cells expressing kinesin-1-EGFP to Figure 3 - figure supplement 2 A-D and the new HeLa cell data is now shown in Figure 3 D-G.

      (2) Given the low number of observations (41-100 puncta), I think a scatter plot showing all data points would offer readers a more transparent means of viewing the single-molecule data presented in Figures 3A, B, C, and G. I also didn't see 'n' values for plots shown in Figure 3. 

      The box and whisker plots have now been replaced with scatter plots showing all data points. The accompanying ‘n’ values have been included in the figure 3 legend as well as the histograms in figures 1 and 2 that are represented in the comparative scatter plots.  

      (3) Given the authors have produced a body of work that challenges conclusions from another pre-print (Tirumala et al., 2022 bioRxiv) - specifically, that dynein is not processive in cells - I think it would be useful to include a short discussion about how their work challenges theirs. For example, one significant difference between the two experimental systems that may account for the different observations could simply be that the authors of the Tirumala study used a mouse DHC (in HeLa cells), which may not have the ability to assemble into active and processive dynein-dynactin-adaptor complexes. 

      Thank you for pointing this out! At the time we submitted our manuscript we were conflicted about citing a pre-print that had not been peer reviewed simply to point out the discrepancy. If we had done so at that time we would have proposed the exact potential technical issue that you have proposed here. However, at the time we felt it would be better for these issues to be addressed through the review process. Needless to say, we agree with your interpretation and now that the work is published (Tirumala et al. JCB, 2024) it is entirely appropriate to add a discussion on Tirumala et al. where contradictory observations were reported. 

      The following statement has been added to the manuscript: 

      “In contrast, a separate study (Tirumala et al., 2024) reported that dynein is not highly processive, typically exhibiting runs of very short duration (~0.6 s) in HeLa cells. A notable technical difference that may account for this discrepancy is that our study visualizes endogenously tagged human DHC, whereas Tirumala et al. characterized over-expressed mouse DHC in HeLa cells. Over-expression of the DHC may result in an imbalance of the subunits that comprise the active motor complex, leading to inactive, or less active complexes. Similarly, mouse DHC may not have the ability to efficiently assemble into active and processive dynein-dynactin-adaptor complexes to the same extent as human DHC.”

      Minor points: 

      (1) "Specifically, the adaptor BICD2 recruited a single dynein to dynactin while BICDR1 and HOOK3 supported assembly of a "double dynein" complex." It would be more accurate to say that dynein-dynactin complexes assembled with Bicd2 "tend to favor single dynein, and the Bicdr1 and Hook3 tend to favor two dyneins" since even Bicd2 can support assembly of 2 dynein-1 dynactin complexes (see Urnavicius et al, Nature 2018). 

      Thank you, the manuscript has been edited to reflect this point. 

      (2) "Human HeLa cells were engineered using CRISPR/Cas9 to insert a cassette encoding FKBP and EGFP tags in the frame at the 3' end of the dynein heavy chain (DYNC1H1) gene (SF1)." It is unclear to what "SF1" is referring. 

      SF1 is supplementary figure 1, which we have now clarified as being Figure 1 – figure supplement 1A.

      (3) "The SiR-Tubulin-treated cells were subjected to two-color TIRFM to determine if the DHC puncta exhibited motility and; indeed, puncta were observed streaming along MTs..." This sentence is strangely punctuated (the ";" is likely a typo?). 

      Thank you for pointing this out, the typo has been corrected and the sentence now reads:

      “The SiR-Tubulin-treated cells were subjected to two-color TIRFM and DHC-EGFP puncta were clearly observed streaming on Sir-Tubulin labeled MTs, which was especially evident on MTs that were pinned between the nucleus and the plasma membrane (Video 3)”

      (4) I am unfamiliar with the "MK" acronym shown above the molecular weight ladders in Figure 3H and I. Did the authors mean to use "MW" for molecular weight? 

      We intended this to mean MW and the typo has been corrected.

      (5) "This suggests that the cargos, which we presume motile dynein-dynactin puncta are bound to, any kinesins..." This sentence is confusing as written. Did the authors mean "and kinesins"? 

      Agreed. We have changed this sentence to now read: 

      “The velocity and low switching frequency of motile puncta suggest that any kinesin motors associated with cargos being transported by the dynein-dynactin visualized here are inactive and/or cannot effectively bind the MT lattice during dynein-dynactin-mediated transport in interphase HeLa cells.”

      Reviewer 2 (Recommendations for the authors):

      (1) I am confused as to why the authors introduced an FKBP tag to the DHC and no explanation is given. Is it possible this tag induces artificial dimerization of the DHC? 

      FKBP was tagged to DHC for potential knock sideways experiments. Since the current cell line does not express the FKBP counterpart FRB, having FKBP alone in the cell line would not lead to artificial dimerization of DHC.

      (2) The authors use a high concentration of SiR-tubulin (1uM) before washing it out. However, they observe strong effects on MT dynamics. The manufacturer states that concentrations below 100nM don't affect MT dynamics, so I am wondering why the authors are using such a high amount that leads to cellular phenotypes. 

      We would like to note that in our hands even 100 nM SiR-tubulin impacted MT dynamics if it was incubated for enough time to get a bright signal for imaging, which makes sense since drugs like docetaxel and taxol become enriched in cells over time. Thus, it was a trade-off between the extent/brightness of labeling and the effects on MT dynamics. We opted for shorter incubation with a higher concentration of Sir-Tubulin to achieve rapid MT labeling and efficient suppression of plus-end MT polymerization. This approach proved useful for our needs since the loss of the tip-tacking pool of DHC provided a clearer view of the motile population of MT-associated DHC.

      (3) The individual channels should be labeled in the supplemental movies. 

      They have now been labelled.

      (4) I would like to see example images and kymographs of the GFP-Kinesin-1 control used for fluorescent intensity analysis. Further, the authors use the mean of the intensity distribution, but I wonder why they don't fit the distribution to a Gaussian instead, as that seems more common in the field to me. Do the data fit well to a Gaussian distribution? 

      Example images and kymographs of the kinesin-1-EGFP control HeLa cells used for the updated fluorescent intensity analysis have been now added to the manuscript in Figure 3 - figure supplement 1. The kinesin-1-EGFP transiently expressed in HeLa cells exhibited a slower mean velocity and run length than the endogenously tagged HeLa dynein-dynactin. Regarding the distribution, we applied 6 normality tests to the new datasets acquired with DHC and p50 in comparison to human kinesin-EGFP in HeLa cells. While we are confident concluding that the data for p50 was normally distributed (p > 0.05 in 6/6), it was more difficult to reach conclusions about the normality of the datasets for kinesin-1 (p > 0.05 in 4/6) and DHC (p > 0.5 in 1/6). We have decided to report the data as scatter plots (per the suggestion in major point 1 by reviewer 1) in the new Figure 3G since it could be misleading to fit a non-normal distribution with a single Gaussian. We note that the likely non-normal distribution of the DHC data (since it “passed” only 1/6 normality tests) could reflect the presence of other populations (e.g. 1 DHC-EGFP in a motile puncta), but we could also not confidently conclude this since attempting to fit the data with a double Gaussian did not pass statistical muster. Indeed, as stated in the text, on lines 197-198 we do not exclude that the range of DHC intensities measured here may include sub-populations of complexes containing a single dynein dimer with one DHC-EGFP molecule.   

      Ultimately, we feel the safest conclusion is that there was not a statically significant difference between the DHC and kinesin-1 dimers (p = 0.32) but there was a statistically significant difference between both the DHC and kinesin-1 dimers compared to the p50 (p values < 0.001), which was ~50% the intensity of DHC and kinesin-1. Altogether this leads us to the fairly conservative conclusion that DHC puncta contain at least one dimer while the p50 puncta likely contain a single p50-EGFP molecule. 

      (5) The authors suggest the microtubules in the cells treated with SiR-tubulin may be more detyrosinated due to the treatment. Why don't they measure this using well-characterized antibodies that distinguish tyrosinated/detyrosinated microtubules in cells treated or not with SiR-tubulin? 

      At your suggestion, we carried out the experiment and found that under our labeling conditions there was not a notable difference in microtubule detyrosination between DMSO- and SiR-Tubulin-treated cells. Thus, we have removed this caveat from the revised manuscript.

      (6) "While we were unable to assess the relative expression levels of tagged versus untagged DHC for technical reasons." Please describe the technical reasons for the inability to measure DHC expression levels for the reader.

      We made several attempts to quantify the relative amounts of untagged and tagged protein by Western blotting. The high molecular weight of DHC (~500kDa) makes it difficult to resolve it on a conventional mini gel. We attempted running a gradient mini gel (4%-15%), and doing a western blot; however, we were still unable to detect DHC. To troubleshoot, the experiments were repeated with different dilutions of a commercially available antibody and varying concentrations of cell lysate; however, we were unable to obtain a satisfactory result. 

      We hold the view that even if it had it worked it would have been difficult to detect a relatively small difference between the untagged (MW = 500kDa) and tagged DHC (MW = 527kDa) by western blot. We have added language to this effect in the revised manuscript. 

      Reviewer #3 (Public Review):

      (1). CRISPR-edited HeLa clones: 

      (i) The authors indicate that both the DHC-EGFP and p50-EGFP lines are heterozygous and that the level of DHC-EGFP was not measured due to technical difficulties. However, quantification of the relative amounts of untagged and tagged DHC needs to be performed - either using Western blot, immunofluorescence or qPCR comparing the parent cell line and the cell lines used in this work. 

      See response to reviewer 2 above. 

      (ii) The localization of DHC predominantly at the plus tips (Fig. 1A) is at odds with other work where endogenous or close-to-endogenous levels of DHC were visualized in HeLa cells and other non-polarized cells like HEK293, A-431 and U-251MG (e.g.: OpenCell (https://opencell.czbiohub.org/target/CID001880), Human Protein Atlas  ), https://www.biorxiv.org/content/10.1101/2021.04.05.438428v3). The authors should perform immunofluorescence of DHC in the parental cells and DHC-EGFP cells to confirm there are no expression artifacts in the latter. Additionally, a comparison of the colocalization of DHC with EB1 in the parental and DHC-EGFP and p50-EGFP lines would be good to confirm MT plus-tip localisation of DHC in both lines. 

      The microtubule (MT) plus-tip localization of DHC was already observed in the 1990s, as evidenced by publications such as (PMID:10212138) and (PMID:12119357), which were further confirmed by Kobayashi and Murayama  in 2009 (PMID:19915671). We hold the view that further investigation into this localization is not worthwhile since the tip-tracking behavior of DHC-dynactin has been long-established in the field.

      (iii) It would also be useful to see entire fields of view of cells expressing DHC-EGFP and p50EGFP (e.g. in Spinning Disk microscopy) to understand if there is heterogeneity in expression. Similarly, it would be useful to report the relative levels of expression of EGFP (by measuring the total intensity of EGFP fluorescence per cell) in those cells employed for the analysis in the manuscript. 

      Representative images of fields have been added as Figure 1 - figure supplement 1B and Figure 2 – figure supplement 1 in the revised manuscript. We did not see drastic cell-tocell variation of expression within the clonal cell lines.

      (iv) Given that the authors suspect there is differential gene regulation in their CRISPR-edited lines, it cannot be concluded that the DHC-EGFP and p50-EGFP punctae tracked are functional and not piggybacking on untagged proteins. The authors could use the FKBP part of the FKBPEGFP tag to perform knock-sideways of the DHC and p50 to the plasma membrane and confirm abrogation of dynein activity by visualizing known dynein targets such as the Golgi (Golgi should disperse following recruitment of EGFP-tagged DHC-EGFP or p50-EGFP to the PM), or EGF (movement towards the cell center should cease). 

      Despite trying different concentrations and extensive troubleshooting, we were not able to replicate the reported observations of Ciliobrevin D or Dynarrestin during mitosis. We would like to emphasize that the velocity (1.2 μm/s) of dynein-dynactin complexes that we measured in HeLa cells was comparable to those measured in iNeurons by Fellows et al. (PMID: 38407313) and for unopposed dynein under in vitro conditions. 

      (2) TIFRM and analysis: 

      (i) What was the rationale for using TIRFM given its limitation of visualization at/near the plasma membrane? Are the authors confident they are in TIRF mode and not HILO, which would fit with the representative images shown in the manuscript? 

      To avoid overcrowding, it was important to image the MT tracks that that were pinned between the nucleus and the plasma membrane. It is unclear to us why the reviewer feels that true TIRFM could not be used to visualize the movement of dynein-dynactin on this population of MTs since the plasma membrane is ~ 3-5 nm and a MT is ~25-27 nm all of which would fall well within the 100-200 nm excitable range of the evanescent wave produced by TIRF. While we feel TIRF can effectively visualize dynein-dynactin motility in cells, we have mentioned the possibility that some imaging may be HILO microscopy in the materials and methods.

      (ii) At what depth are the authors imaging DHC-EGFP and p50-EGFP? 

      The imaging depth of traditional TIRFM is limited to around 100-200 nm. In adherent interphase HeLa cells the nucleus is in very close proximity (nanometer not micron scale) to the plasma membrane with some cytoskeletal filaments (actin) and microtubules positioned between the plasma membrane and the nuclear membrane. The fact that we were often visualizing MTs positioned between the nucleus and the membrane makes us confident that we were imaging at a depth (100 - 200nm) consistent with TIRFM. 

      (iii) The authors rely on manual inspection of tracks before analyzing them in kymographs - this is not rigorous and is prone to bias. They should instead track the molecules using single particle tracking tools (eg. TrackMate/uTrack), and use these traces to then quantify the displacement, velocity, and run-time. 

      Although automated single particle tracking tools offer several benefits, including reduced human effort, and scalability for large datasets, they often rely on specialized training datasets and do not generalize well to every dataset. The authors contend that under complex cellular environments human intervention is often necessary to achieve a reliable dataset. Considering the nature of our data we felt it was necessary to manually process the time-lapses. 

      (iv) It is unclear how the tracks that were eventually used in the quantification were chosen. Are they representative of the kind of movements seen? Kymographs of dynein movement along an entire MT/cell needs to be shown and all punctae that appear on MTs need to be tracked, and their movement quantified. 

      Considering the densely populated environment of a cell, it will be nearly impossible to quantity all the datasets. We selected tracks for quantification, focusing on areas where MTs were pinned between the nucleus and plasma membrane where we could track the movement of a single dynein molecule and where the surroundings were relatively less crowded. 

      (v) What is the directionality of the moving punctae? 

      In our experience, cells rarely organized their MTs in the textbook radial MT array meaning that one could not confidently conclude that “inward” movements were minus-end directed. Microtubule polarity was also not able to be determined for the MTs positioned between the plasma membrane and the nucleus on which many of the puncta we quantified were moving. It was clear that motile puncta moving on the same MT moved in the same direction with the exception of rare and brief directional switching events. What was more common than directional switching on the same MT were motile puncta exhibiting changes in direction at sharp (sometimes perpendicular) angles indicative of MT track switching, which is a well-characterized behavior of dynein-dynactin (See DOI: 10.1529/biophysj.107.120014).

      (vi) Since all the quantification was performed on SiR tubulin-treated cells, it is unclear if the behavior of dynein observed here reflects the behavior of dynein in untreated cells. Analysis of untreated cells is required. 

      It was important to quantify SiR tubulin-treated cells because SiR-Tubulin is a docetaxel derivative, and its addition suppressed plus-end MT polymerization resulting in a significant reduction in the DHC tip-tracking population and a clearer view of the motile population of MT-associated DHC puncta. Otherwise, it was challenging to reliably identify motile puncta given the abundance of DHC tip-tracking populations in untreated cells.  

      (3) Estimation of stoichiometry of DHC and p50 

      Given that the punctae of DHC-EGFP and p50 seemingly bleach on MT before the end of the movie, the authors should use photobleaching to estimate the number of molecules in their punctae, either by simple counting the number of bleaching steps or by measuring single-step sizes and estimating the number of molecules from the intensity of punctae in the first frame. 

      Comparing the fluorescence intensity of a known molecule (in our case a kinesin-1EGFP dimer) to calculate the numbers of an unknown protein molecule (in our case Dynein or p50) is a widely accepted technique in the field. For example, refer to PMID: 29899040. To accurately estimate the stoichiometry of DHC and p50 and address the concerns raised by other reviewers, we expressed the human kinesin-EGFP in HeLa cells and analyzed the datasets from new experiments. We did not observe any significant differences between our old and new datasets.

      (4) Discussion of prior literature 

      Recent work visualizing the behavior of dyneins in HeLa cells (DOI:  10.1101/2021.04.05.438428), which shows results that do not align with observations in this manuscript, has not been discussed. These contradictory findings need to be discussed, and a more objective assessment of the literature in general needs to be undertaken.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study focuses on the bacterial metabolite TMA, generated from dietary choline. These authors and others have previously generated foundational knowledge about the TMA metabolite TMAO, and its role in metabolic disease. This study extends those findings to test whether TMAO's precursor, TMA, and its receptor TAAR5 are also involved and necessary for some of these metabolic phenotypes. They find that mice lacking the host TMA receptor (Taar5-/-) have altered circadian rhythms in gene expression, metabolic hormones, gut microbiome composition, and olfactory and innate behavior. In parallel, mice lacking bacterial TMA production or host TMA oxidation have altered circadian rhythms.

      Strengths:

      These authors use state-of-the-art bacterial and murine genetics to dissect the roles of TMA, TMAO, and their receptor in various metabolic outcomes (primarily measuring plasma and tissue cytokine/gene expression). They also follow a unique and unexpected behavioral/olfactory phenotype. Statistics are impeccable.

      Weaknesses:

      Enthusiasm for the manuscript is dampened by some ambiguous writing and the presentation of ideas in the introduction, both of which could easily be improved upon revision.

      We apologize for the abbreviated and ambiguous writing style in our original submission. Given Reviewer 2 also suggested reorganizing and rewriting certain parts, we have spent time to remove ambiguity by adding additional points of clarification and adding more historical context to justify studying TMA-TAAR5 signaling in regulating host circadian rhythms. We have also reorganized the presentation of data aligned with this.

      Reviewer #2 (Public review):

      Summary:

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions.

      Strengths:

      (1) The manuscript addresses an important and timely topic in host-microbe communication and circadian biology.

      (2) The studies employ multiple complementary models, e.g., Taar5 knockout mice, microbial mutants, which enhance the depth of the investigation.

      (3) The integration of behavioral, hormonal, microbial, and transcript-level data provides a multifaceted view of the observed phenotype.

      (4) The identification of olfactory-linked circadian changes in the context of gut microbes adds a novel perspective to the field.

      Weaknesses:

      While the manuscript presents compelling data, several weaknesses limit the clarity and strength of the conclusions.

      (1) The presentation of hormonal, cytokine, behavioral, and microbiome data would benefit from clearer organization, more detailed descriptions, and functional grouping to aid interpretation.

      We appreciate this comment and have reorganized the data to improve functional grouping and readability. We have also added additional detail to descriptions of the data in the revised figure legends and results.

      (2) Some transitions-particularly from behavioral to microbiome data-are abrupt and would benefit from better contextual framing.

      We agree with this comment, and have added additional language to provide smoother transitions. This in many cases brings in historical context of why we focused on both behavioral and microbiome alterations in this body of work.

      (3) The microbial rhythmicity analyses lack detail on methods and visualization, and the sequencing metadata (e.g., sample type, sex, method) are not clearly stated.

      We apologize for this, and have now added more detail in our methods, figures, and figure legends to ensure the reader can easily understand sample type, sex, and the methods used. 

      (4) Several figures are difficult to interpret due to dense layouts or vague legends, and key metabolites and gene expression comparisons are either underexplained or not consistently assessed across models.

      Aligned with the last comment we now added more detail in our methods, figures, and figure legends to provide clear information. We have now provided additional data showing the same key metabolites, hormones, and gene expression alterations in each model if the same endpoints were measured.

      (5) Finally, while the authors suggest a causal role for TAAR5 and its ligand in circadian regulation, the current data remain correlative; mechanistic experiments or stronger disclaimers are needed to support these claims.

      We agree with this comment, and as a result have removed any language causally linking TMA and TAAR5 together in circadian regulation. Instead, we only state finding in each model and refrain from overinterpreting.

      Reviewer #3 (Public review):

      Summary:

      Deletion of the TMA-sensor TAAR5 results in circadian alterations in gene expression, particularly in the olfactory bulb, plasma hormones, and neurobehaviors.

      Strengths:

      Genetic background was rigorously controlled.

      Comprehensive characterization.

      Weaknesses:

      The weaknesses identified by this reviewer are minor.

      Overall, the studies are very nicely done. However, despite careful experimentation, I note that even the controls vary considerably in their gene expression, etc, across time (eg, compare control graphs for Cry 1 in IB, 4B). It makes me wonder how inherently noisy these measurements are. While I think that the overall point that the Taar5 KO shows circadian changes is robust, future studies to dissect which changes are reproducible over the noise would be helpful.

      We thank the reviewer for this insightful comment. We completely agree that there are clear differences in the circadian data in experiments from Taar5<sup>-/-</sup> mice and those from gnotobiotic mice where we have genetically deleted CutC. Although the data from Taar5<sup>-/-</sup> mice show nice robust circadian rhythms, the data from mice where microbial CutC is altered have inherently more “noise”. We attribute some of this to the fact that the Taar5<sup>-/-</sup> mouse experiment have a fully intact and diverse gut microbiome . Whereas, the gnotobiotic study with CutC manipulation includes only a 6 member microbiome community that does not represent the normal microbiome diversity in the gut. This defined synthetic community was used as a rigorous reductionist approach, but likely affected the normal interactions between a complex intact gut microbiome and host circadian rhythms. We have added some additional discussion to indicate this in the limitations section of the manuscript.

      Impact:

      These data add to the growing literature pointing to a role for the TMA/TMAO pathway in olfaction and neurobehavioral.

      Reviewer #1 (Recommendations for the authors):

      I suggest a revision of the writing and organization. The potential impact of the study after reading the introduction is unclear. One example, in the intro, " TMAO levels are associated with many human diseases including diverse forms of CVD5-12, obesity13,14, type 2 diabetes15,16, chronic kidney disease (CKD)17,18, neurodegenerative conditions including Parkinson's and Alzheimer's disease19,20, and several cancers21,22" It would be helpful to explain how the previous literature has distinguished that the driver of these phenotypes is TMA/TMAO and not increased choline intake. Basically, for a TMA/O novice reader, a more detailed intro would be helpful.

      We appreciate this insightful comment and have now provided a more expansive historical context for the reader regarding the effects of choline consumption (which impacts many things, including choline, acetylcholine, phosphatidylcholine, TMA, TMAO, etc) versus the primary effects of TMA and TMAO.

      There were also many uses of vague language (regulation/impact/etc). Directionality would be super helpful.

      We thank the reviewer for this recommendation and have improved language as suggested to show directionality of our findings. The terms regulation, impact, shape etc. are used only when we describe multiple variable changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      Reviewer #2 (Recommendations for the authors):

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions. Below are specific major and minor suggestions intended to enhance the presentation and interpretation of the data.

      Major suggestions:

      (1) Consider adding a schematic/model figure as Panel A early in the manuscript to help readers understand the experimental conditions and major comparisons being made.

      We thank the reviewer for this recommendation and have added a graphical abstract figure to help the reader understand the major comparisons being made. 

      (2) Could the authors present body weight and food intake characteristics in Taar5 KO vs. WT animals?

      We have added body weight data as requested in Figure 1, Figure supplement 1. Although we have not stressed these mice with a high fat diet for these behavioral studies, under chow-fed conditions studied here we did not find any significant differences in body weight. Given no difference in body weight, we did not collect data on food consumption and have mentioned this as a limitation in the discussion.  

      (3) Several figures, especially Figures 3 and 4, and Supplemental Figures, would benefit from more structured organization and expanded legends. Grouping related data into thematic panels (e.g., satiety vs. appetite hormones, behavioral domains) may help improve readability.

      We appreciate the reviewer’s thoughtful comments and agree that reorganization would improve clarity. We have reorganized figures to improve clarity and have expanded the figure legends to provide more detail on experimental methods. 

      (4) Clarify and expand the description of hormonal and cytokine changes. For instance, the phrase "altered rhythmic levels" is vague - do the authors mean dampened, phase-shifted, enhanced, etc., relative to WT controls?

      Given a similar suggestion was made by Reviewer 1, we have provided more precise language focused on directionality and which specific endpoints we are referring to. For anything looking at circadian rhythms, the revised manuscript includes specific indications when we are discussing mesor, amplitude, and acrophase alterations. The terms regulation, impact, shape etc. are used only when we describe multiple complex variables changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      (5) Consider grouping hormones and cytokines functionally (e.g., satiety vs. appetite-stimulating, pro- vs. antiinflammatory) to better interpret how these changes relate to the KO phenotype.

      We thank the reviewer for this recommendation, and have re-organized figure panels to reflect this.

      (6) Please provide a more detailed description of the behavioral results, particularly those in Supplemental Figure 2.

      We have both expanded the methods description in the revised figure legends, but have also added a more detailed description of the behavioral results.

      (7) As with hormonal data, behavioral outcomes would be easier to follow if organized thematically (e.g., locomotor activity, anxiety-like behavior, circadian-related behavior), especially for readers less familiar with behavioral assays.

      We appreciate this reviewer’s comment and agree that we can better group our data to show how each test is associated with the type of behavior it assesses. As a result we have reorganized the behavioral data into broad categories such as olfactory-related, innate, cognitive, depressive/anxiety-like, or social behaviors. We have also new data in each of these behavioral categories to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice.

      (8) The following statement needs clarification: "Also, it is important to note that many behavioral phenotypes examined, including tests not shown, were unaltered in Taar5-/- mice (Figures S2G, S2H, and S2I)." Consider rephrasing to explicitly state the intended message: are the authors emphasizing a lack of behavioral phenotype, or highlighting specific unaltered aspects?

      We apologize for this confusing statement, and have changed the verbiage to improve readability. To expand the comprehensive nature of this study, we also now include the tests that were “not shown” in the original submission to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice. These new data are included as 6 different figure supplements to main Figure 2.

      (9) The transition from behavior to microbiome data feels abrupt. Can the authors better explain whether the behavioral changes are thought to result from gut microbial function, independent of TMA-Taar5 signaling?

      We apologize for the poor transitions in our writing style. We have spent time to explain the previous findings linking the TMA pathway to circadian reorganization of the gut microbiome (mostly coming from our original paper Schugar R, et al. 2022, eLife) and how this correlates with behavioral phenotypes. Although at this point it is difficult to know whether the microbiome changes are driving behavioral changes, or vice versa it could be central TAAR5 signaling is altering oscillations in gut microbiome, we present our findings here as a framework for follow up studies to more precisely get at these questions. It is important to note that our experiment using defined community gnotobiotic mice with or without the capacity to produce TMA (i.e. CutC-null community) shows that clearly microbial TMA production can impact host circadian rhythms in the olfactory bulb. Additional experiments beyond the scope of this work will be required to test which phenotypes originate from TMA-TAAR5 signaling versus more broad effects of the restructured gut microbiome.

      (10) For Figure 3A, please expand the microbiome results with more granularity:

      (a) Indicate in the Results section whether the sequencing method was 16S amplicon or metagenomic.

      Sequencing was done using 16S rRNA amplicon sequencing using methods published by our group (PMID: 36417437, PMID: 35448550).

      (b) State whether samples were from males, females, or a mix. 

      We have indicated that all mice from Figure 1 were male mice in the revised figure legend.

      (c) Clarify whether beta diversity is based on phylogenetic or non-phylogenetic metrics. Consider using both  types if not already done.

      Beta diversity was analyzed using the Bray-Curtis dissimilarity index as the metric. Details have been included in the methods section.

      (d) Make lines partially transparent in the Beta-diversity plot so that individual points are visible.

      We have now updated the Beta-diversity plot with individual points visualized.

      (e) Clarify what percentage of variation in the Beta-diversity plot is explained by CCA1, and whether this low percentage suggests minimal community-level differences.

      We have updated the Beta-diversity plot to include the R<sup>2</sup> and p-values associated with these data.

      (f) Confirm if the y-axis on the Beta-diversity plot should be labeled CCA2 rather than "CCAA 1".

      We appreciate this comments, given it identified a typographical error in the plot. The revised figure now include the proper label of CCA2 instead of CCAA 1.

      (11) For Figure 3B:

      (a) Provide a description of the taxonomy plot in the results.

      We have added a description of the taxonomy plot in the revised results section.

      (b) Add phylum-level labels and enlarge the legend to improve the readability of genus-level data.

      We agree this is a good suggestion so have enlarged the legend for the genus-level data and have also added phylum-level plots as well in the revised manuscript in Figure 3, figure supplement 1.

      (12) Rhythmicity of the microbiome is central to the manuscript. The current approach of comparing relative abundance at discrete time points is limiting.

      We thank the reviewer for this comment. We agree with this statement that discrete timepoint are not enough to describe circadian rhythmicity. In addition to comparing genotypes at discrete time points, we also used a rigorous cosinor analysis to plot the data over a 24-hour time period, and those differences are shown in the figure itself as well as Table 1. 

      (a) Please describe how rhythmicity was determined, e.g., what data or statistical method supports the statement: "Taar5-/- mice showed loss of the normal rhythmicity for Dubosiella and Odoribacter genera yet gained in amplitude of rhythmicity for Bacteroides genera (Figure 3 and S3)."

      We appreciate this reviewer comment. Rhythmicity was determined using a cosinor analysis by use of an R program. Cosinor analysis is a statistical method used to model and analyze rhythmic patterns in time-series data, typically assuming a sinusoidal (cosine) shape. It estimates key parameters like mesor (mean level), amplitude (height of oscillation), and acrophase (timing of the peak), making it especially useful in fields like chronobiology and circadian rhythm research. We have used this in previous research to describe circadian rhythms. We do plan to improve language considering directionality of these circadian changes. 

      (b) Supplemental Figure S3 needs reorganization to highlight key findings. It's not currently clear how taxa are arranged or what trends are being shown.

      The data in Figure S3 show the entire 24-hour time course of the cecal taxa that were significantly altered for at least one time point between Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice. Given we showed time pointspecific alterations in the Main Figure 3, we thought these more expansive plots would be important to show to depict how the circadian rhythms were altered.

      (c) Supplemental Table 1, which includes 16S features, should be referenced and discussed in the microbiome section.

      We have now referenced and discussed Supplemental Table 1 which includes all cosinor statistics for microbiome and other data presented in circadian time point studies.

      (13) Did the authors quantify the 16S rRNA gene via RT-PCR to determine if this was similar between KO and WT over the 24-hour period?

      We did not quantify 16S rRNA gene via RT-PCR, but do not think adding this will change our overall interpretations.

      (14) Reorganize Figure 4 to align with the order of results discussed-starting with TMA and TMAO, followed by related metabolites like choline, L-carnitine, and gamma-butyrobetaine.

      We thank the reviewer for this comment. We have chosen this organization because it is ordered from substrates (choline, L-carnitine, and betaine) to the microbe-associated products (TMA then TMAO). We will improve the writing associated with this figure to clearly explain this organization.

      (a) Although the changes in the latter metabolites are more modest, they may still have physiological relevance. Could the authors comment on their significance?

      We appreciate this reviewer comment and agree. We have expanded the results and discussion to address this.

      (15) The authors note similarities in circadian gene expression between Taar5 KO mice and Clostridium sporogenes WT vs. ΔcutC mice, but the gene patterns are not consistent.

      (a) Can the authors clarify what conclusions can reasonably be drawn from this comparison?

      We hesitate to make definitive conclusions in the manuscript on why the gene patterns are not consistent, because it would be speculation. However, one major factor likely driving differences is the status of the diversity of the gut microbiome in the different studies. For instance, in the studies using Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice there is a very diverse microbiome in these conventionally housed mice. In contrast, by design the experiment using Clostridium sporogenes WT vs. ΔcutC communities is a reductionist approach that allows us to genetically define TMA production. In these gnotobiotic mice, the simplified community has very limited diversity and this likely alters the host circadian rhythms in gene expression quite dramatically. Although it is impossible to directly compare the results between these experiments given the difference microbiome diversity, there are clearly alterations in host gene expression when we manipulate TMA production (i.e. ΔcutC community) or TMA sensing (i.e. Taar5<sup>-/-</sup>). 

      (16) Were circadian and metabolic genes (e.g., Arntl, Cry1, Per2, Pemt, Pdk4) also analyzed in brown adipose tissue of Taar5 KO mice, and how do these results compare to the Clostridium models?

      We thank the reviewer for this comment. Unfortunately, we did not collect brown adipose tissue in our original Taar5 study. We plan on doing this in future follow up studies studying cold-induced thermogenesis that are beyond the scope of this manuscript. However, we have decided to include data from our two timepoint Taar5 study which looks at ZT2 (9am) and ZT14 (9pm). There are clear differences in circadian genes between these timepoints. 

      (17) To allow a more direct comparison, please ensure the same cytokines (e.g., IL-1β, IL-2, TNF-α, IFN-γ, IL6, IL-33) are reported for both the Taar5 KO and microbial models.

      We thank the reviewer for this comment and now include data from the same cytokines for each study.

      (18) What was the defined microbial community used to colonize germ-free mice with C. sporogenes strains? Did this community exhibit oscillatory behavior?

      To define TMA levels using a genetically-tractable model of a defined microbial community, we leveraged access to the community originally described by our collaborator Dr. Federico Rey (University of Wisconsin – Madison) (PMID: 25784704). We chose this community because it provide some functional metabolic diversity and is well known to allow for sufficient versus deficient TMA production. We are thankful for the reviewer comments about oscillatory behavior of this defined community, and to be responsive have performed sequencing to detect the species over time. These data are now included in the revised manuscript and show that there are clear differences in the oscillatory behavior of the defined community members. These data provide additional support that bacterial TMA production not only alters host circadian rhythms, but also the rhythmic behavior of gut bacteria themselves which has never been described before.

      (19) Can the authors explain the rationale for measuring additional metabolites such as tryptophan, indole acetic acid, phenylacetic acid, and phenylacetylglycine? How are these linked to CutC gene function or Taar5 signaling?

      We appreciate that this could be confusing, but have included other gut microbial metabolites to be as comprehensive as possible. This is important to include because we have found in other gnotobiotic studies where we have genetically altered metabolite production, if we alter one gut microbe-derived metabolite there can be unexpected alterations in other distinct classes of microbe-derived metabolites (PMID: 37352836). This is likely due to the fact that complex microbe-microbe and microbehost interactions work together to define systemic levels of circulating metabolites, influencing both the production and turnover of distinct and unrelated metabolites.

      (20) The authors make several strong claims suggesting that loss of Taar5 or disruption of its ligand directly alters the circadian gene network. However, the current data are correlative. The authors should clarify that these findings demonstrate associations rather than direct causal effects, unless additional mechanistic evidence is provided. Approaches such as studies conducted in constant darkness, measurements of wheelrunning behavior, or analyses that control for potential confounding factors, e.g., inflammation or metabolic disruption, would help establish whether the observed changes in clock gene expression are primary or secondary effects. The authors are encouraged to either soften these causal claims or acknowledge this limitation explicitly in the discussion.

      We thank the reviewer for this comment. We agree and have softened our language about direct effects of TMA via TAAR5 because we agree the data presented here are correlative only. 

      Minor suggestions:

      (1) Avoid repetitive phrases such as "it is important to note..." for improved flow. Rephrasing these instances will enhance readability.

      We thank the reviewer for this suggestion and have deleted such repetitive phrases.  

      (2) For Figure 2, remove interpretations above he graphs and use simple, descriptive panel labels, similar to those in Supplemental Figure 2.

      We have removed these interpretations as suggested, but have retained descriptive panel labels to help the reader understand what type of data are being presented.

      Reviewer #3 (Recommendations for the authors):

      Minor:

      In Figure 1D, UCP1 does not appear to be significantly changed.

      We thank the reviewer for this comment and agree that UCP1 gene expression is not significantly altered . However, given the key role that UCP1 plays in white adipose tissue beiging, which is suppressed by the TMAO pathway, we think it is critical to show that this effect appears unaffected by perturbed TMA-TAAR5 signaling.

      It would be helpful, in the discussion, to summarize any consistent changes across Taar5 KO, CutC deletion, and FMO3 deletion.

      We have added this to the discussion, but as discussed above we hesitate to make strong interpretations about consistency between the models because the microbiome diversity is so different between the studies, and we did not measure all endpoints in both models.

      For the Cosinor analysis, it may be helpful to remove the p-values that are >0.05 from the figures.

      We have now removed any non-significant p-values that are associated with our figures. 

      For Figure 2, Supplement 1E, what are the two bars for each genotype?

      We appreciate the reviewer pointing this out and will further explain this test in the figure with labels and in the legend.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Editors comments:

      I would encourage you to submit a revised version that addresses the following two points:

      [a] The point from Reviewer #1 about a possible major confounding factor. The following article might be germane here: Baas and Fennell, 2019: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3339568

      I don’t believe that the point raised by reviewer 1 is a confounder, see my response below.

      This article highlighted was in my reading list, but I did not cite it because I was confused by its methods.

      The point from Reviewer #4 about the abstract. It is important that the abstract says something about how reviewers reacted to the original versions of articles in which they were cited (ie, the odds ratio = 0.84, etc result), before going on to discuss how they reacted to revised articles (ie, the odds ratio = 1.61, etc result). I would suggest doing this along the following lines - but please feel free to reword the passage "but this effect was not strong/conclusive":

      When reviewers were cited in the original version of the article under review, they were less likely to approve the article compared with reviewers who were not cited, but this effect was not strong/conclusive (odds ratio = 0.84; adjusted 99.4% CI: 0.69-1.03). However, when reviewers were cited in the revised version of the article, they were more likely to approve compared with reviewers who were not cited (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23).

      I have changed the abstract to include the odds ratios for version 1 and have used the same wording as from the main text.

      Reviewer #1 (Public review):

      Summary:

      The work used open peer reviews and followed them through a succession of reviews and author revisions. It assessed whether a reviewer had requested the author include additional citations and references to the reviewers' work. It then assessed whether the author had followed these suggestions and what the probability of acceptance was based on the authors decision. Reviewers who were cited were more likely to recommend the article for publication when compared with reviewers that were not cited. Reviewers who requested and received a citation were much likely to accept than reviewers that requested and did not receive a citation.

      Strengths and weaknesses:

      The work's strengths are the in-depth and thorough statistical analysis it contains and the very large dataset it uses. The methods are robust and reported in detail.

      I am still concerned that there is a major confounding factor: if you ignore the reviewers requests for citations are you more likely to have ignored all their other suggestions too? This has now been mentioned briefly and slightly circuitously in the limitations section. I would still like this (I think) major limitation to be given more consideration and discussion, although I am happy that it cannot be addressed directly in the analysis.

      This is likely to happen, but I do not think it’s a confounder. A confounder needs to be associated with both the outcome and the exposure of interest. If we consider forthright authors who are more likely to rebuff all suggestions, then they would receive just as many citation and self-citation requests as authors who were more compliant. The behaviour of forthright authors would likely only reduce the association seen in most authors which would be reflected in the odds ratios.

      Reviewer #2 (Public review):

      Summary:

      This article examines reviewer coercion in the form of requesting citations to the reviewer's own work as a possible trade for acceptance and shows that, under certain conditions, this happens.

      Strengths:

      The methods are well done and the results support the conclusions that some reviewers "request" self-citations and may be making acceptance decisions based on whether an author fulfills that request.

      Weakness:

      I thank the author for addressing my comments about the original version.

      Reviewer #3 (Public review):

      Summary:

      In this article, Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Strengths:

      The author uses a clever analytical design, examining four journals that use the same open peer review system, in which the identities of the authors and reviewers are both available and linkable to structured data. Categorical information about the approval is also available as structured data. This design allows a large scale investigation of this question.

      Weaknesses:

      My original concerns have been largely addressed. Much more detail is provided about the number of documents under consideration for each analysis, which clarifies a great deal.

      Much of the observed reviewer behavior disappears or has much lower effect sizes depending on whether "Accept with Reservations" is considered an Accept or a Reject. This is acknowledged in the results text. Language has been toned down in the revised version.

      The conditional analysis on the 441 reviews (lines 224-228) does support the revised interpretation as presented.

      No additional concerns are noted.

      Reviewer #4 (Public review):

      Summary:

      This work investigates whether a citation to a referee made by a paper is associated with a more positive evaluation by that referee for that paper. It provides evidence supporting this hypothesis. The work also investigates the role of self-citations by referees where the referee would ask authors to cite the referee's paper.

      Strengths:

      This is an important problem: referees for scientific papers must provide their impartial opinions rooted in core scientific principles. Any undue influence due to the role of citations breaks this requirement. This work studies the possible presence and extent of this.

      The methods are solid and well done. The work uses a matched pair design which controls for article-level confounding and further investigates robustness to other potential confounds.

      Weaknesses:

      The authors have addressed most concerns in the initial review. The only remaining concern is the asymmetric reporting and highlighting of version 1 (null result) versus version 2 (rejecting null). For example the abstract says "We find that reviewers who were cited in the article under review were more likely to recommend approval, but only after the first version (odds ratio = 1.61; adjusted 99.4% CI: 1.16 to 2.23)" instead of a symmetric sentence "We find ... in version 1 and ... in version 2".

      The latest version now includes the results for both versions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review):

      (1) Why would BPS not reduce RLS in WT cells? The authors could test whether OE of FIT2 reduces RLS in WT cells.  

      Our data indicate that the iron regulon gets turned on naturally in old cells, presumably due to reduced iron sensing, limiting their lifespan. Although we haven’t tested it experimentally, BPS would also turn on the iron regulon presumably in wild type cells and therefore would have a redundant effect with the activation of the iron regulon that occurs naturally during normal aging. It may be interesting in the future to see if higher levels of BPS can shorten the lifespan of wildtype cells. Similarly, we would predict that overexpression of FIT2 may reduce the lifespan, as its deletion has been shown to extend RLS.  

      (2) The authors should add a brief explanation for why the GDP1 promoter was chosen for Ssd1 OE.

      We used the same promoter that was used to overexpress Ssd1 in all previous studies. This is now stated in the text along with the relevant citations. 

      (3) On page 12, growth to saturation was described as glucose starvation. This is more accurately described as nutrient deprivation. Referring to it as glucose starvation is akin to CR, which growing to saturation is not. Ssd1 OE formed condensates upon saturation but not in CR. Why do the authors think Ssd1 OE did not form condensates upon CR?

      Too mild a stress?

      This is a fair comment, and we have now changed glucose starvation to nutrient deprivation, as it is more accurate. The effects of nutrient starvation are profound: the cell cycle stops, autophagy is induced, cells undergo the diauxic shift, metabolism changes. None of these changes occur during calorie restriction (0.05% glucose) such that it is not too surprising that Ssd1 does not form condensates during CR. We speculate that the stress is just too mild.   

      (4) The authors conclude that the main mechanism for RLS extension in CR and Ssd1 OE is the inhibition of the iron regulon in aging cells. The data certainly supports this. However, this may be an overstatement as other mutations block CR, such as mutations that impair respiration. The authors do note that induction of the iron regulon in aging cells could be a response to impaired mitochondrial function. Thus, it seems that the main goal of CR and Ssd1 OE may be to restore mitochondrial function in aging cells, one way being inactivation of the iron regulon. A discussion of how other mutations impact CR would be of benefit.

      While some labs have shown that respiration impacts CR, this is not the case in other studies. For example, an impactful paper by Kaeberlein et al., PLOS Genetics 2005 showed that CR does extend lifespan in respiratory deficient strains using many different strain backgrounds.

      (5) The cell cycle regulation of Ssd1 OE condensates is very interesting. There does not appear to be literature linking Ssd1 with proteasome-dependent protein turnover. Many proteins involved in cell cycle regulation and genome stability are regulated through ubiquitination. It is not necessary to do anything here about it, but it would be interesting to address how Ssd1 condensates may be regulated with such precision.

      we see no evidence of changes in Ssd1 protein intensity during the cell cycle. The difference therefore we speculate is at the post translational level rather than Ssd1 degradation and there are known cell cycle regulated phosphatase and kinase that regulates Ssd1 phosphorylation and condensation state whose timing of function match when the Ssd1 condensates appear and dissolve in the cell cycle. We have now discussed this and elude to it in the model. 

      (6) While reading the draft, I kept asking myself what the relevance to human biology was. I was very impressed with the extensive literature review at the end of the discussion, going over how well conserved this strategy is in yeast with humans. I suggest referring to this earlier, perhaps even in the abstract. This would nail down how relevant this model is for understanding human longevity regulation.

      Thank you, we have now mentioned in the abstract the relevance to human work. 

      In conclusion, I enjoyed reading this manuscript, describing how Ssd1 OE and CR lead to RLS increases, using different mechanisms. However, since the 2 strategies appear to be using redundant mechanisms, I was surprised that synergism was not observed.

      We thank the reviewer for their kind comment. We propose that Ssd1 overexpression impacts the levels of the iron regulon transcripts, which would be downstream of the point in the pathway that is affected by CR, i.e., nuclear localization of Aft1. The lack of synergy fits with this model, as Ssd1 overexpression cannot impact the iron regulon transcripts if they are not induced due to CR. We have now improved the model to make the impact of these different anti-aging interventions on activation of the iron regulon more clear.

      Reviewer #3 (Public review):

      My main concern is that the central reasoning of the paper-that Ssd1 overexpression and CR prevent the activation of the iron regulon-appears to be contradicted by previous findings, and the authors may actually be misrepresenting these studies, unless I am mistaken. In the manuscript, the authors state on two occasions:

      "Intriguingly, transcripts that had altered abundance in CR vs control media and in SSD1 vs ssd1∆ yeast included the FIT1, FIT2, FIT3, and ARN1 genes of the iron regulon (8)"

      "Ssd1 and CR both reduce the levels of mRNAs of genes within the iron regulon: FIT1, FIT2, FIT3 and ARN1 (8)"

      However, reference (8) by Kaeberlein et al. actually says the opposite:

      "Using RNA derived from three independent experiments, a total of 97 genes were observed to undergo a change in expression >1.5-fold in SSD1-V cells relative to ssd1d cells (supplemental Table 1 at http://www.genetics.org/supplemental/). Of these 97 genes, only 6 underwent similar transcriptional changes in calorically restricted cells (Table 2). This is only slightly greater than the number of genes expected to overlap between the SSD1-V and CR datasets by chance and is in contrast to the highly significant overlap in transcriptional changes observed between CR and HAP4 overexpression (Lin et al. 2002) or between CR and high external osmolarity (Kaeberlein et al. 2002). Intriguingly, of the 6 genes that show similar transcriptional changes in calorically restricted cells and SSD1-V cells, 4 are involved in ironsiderochrome transport: FIT1, FIT2, FIT3, and ARN1 (supplemental Table 1 at http://www.genetics.org/supplemental/)."

      Although the phrasing might be ambiguous at first reading, this interpretation is confirmed upon reviewing Matt Kaeberlein's PhD thesis: https://dspace.mit.edu/handle/1721.1/8318 (page 264 and so on).

      Moreover, consistent with this, activation of the iron regulon during calorie restriction (or the diauxic shift) has also been observed in two other articles:

      https://doi.org/10.1016/S1016-8478(23)13999-9

      https://doi.org/10.1074/jbc.M307447200

      Taken together, these contradictory data might blur the proposed model and make it unclear how to reconcile the results.

      We thank the reviewer for pointing this out. Upon further consideration, we have now removed all mention of this paper from our manuscript as it is irrelevant to our situation, because the mRNA abundance studies during CR or with and without Ssd1 were not performed in situations in which the iron regulon is even activated such as aging, so there would not be any opportunity to detect reduced transcript levels due to CR or Ssd1 presence. Also, none of these studies were performed with Ssd1 overexpression which is the situation we are examining.  Our data clearly show that Ssd1 overexpression and CR reduced / prevented, respectively, production of proteins from the iron regulon during aging.

      We do not feel that the iron regulon being activated by nutrient depletion at the diauxic shift is a fair comparison to the situation in cells happily dividing during CR. The levels of nutrient deprivation used in those studies have profound effects including arresting cell growth, activating autophagy, altering metabolism. The levels of CR that we use (0.05% glucose) does not activate any of these changes nor the iron regulon in young cells or old cells (Fig. 4).  

      Reviewer #1 (Recommendations for the authors):

      (1) The role of Ssd1 condensate formation in mRNA sequestration and lifespan expansion remains unclear. Thus, the study involves two parts (Ssd1 condensate formation and lifespan expansion via limiting Fe2+ accumulation), which are poorly linked. The study will therefore benefit from further data linking the two aspects.

      Future experiments are planned to determine what mRNAs reside in the age-induced Ssd1 overexpression condensates, to determine if they include the iron regulon transcripts. This will require us to optimize isolation of old cells and isolation of the Ssd1 condensates from them, and is beyond the scope of the present study.

      (2) The beneficial effects of Ssd1 overexpression and calorie restriction (CR) on lifespan are epistatic, yet the claim that both experimental conditions act via the same pathway should be further documented. It is recommended to combine Ssd1 overexpression with a well-defined condition that expands lifespan through a mechanism not involving changes in Fe2+ levels. A further increase in lifespan upon combining such conditions would at least indirectly support the authors' claim.

      We have more than epistatic evidence to indicate that Ssd1 overexpression and CR are in the same pathway. Ssd1 overexpression and CR result in failure to properly induce the iron regulon during aging and subsequent reduced levels of iron, resulting in lifespan extension, supporting that they act via the same pathway. We do appreciate the point though and epistasis analyses are on our list for future studies.

      (3) It is highly recommended to analyze ssd1 knockout cells: Is the shortened lifespan caused by intracellular Fe2+ accumulation, as predicted by the model? Does the knockout lead to an overactivation of the iron regulon? Such analysis will also document the physiological relevance of authentic Ssd1 levels in controlling yeast lifespan. The authors could test this possibility by determining intracellular Fe2+ levels (as done in Figure 5) and testing whether the mutant cells are partially rescued by the presence of an iron chelator (as done in Figure 5C).

      We don’t think the normal role of Ssd1 is to sequester the iron regulon mRNAs to prevent its activation, given that wild type yeast with endogenous Ssd1 activates the iron regulon during aging. Rather, the failure to activate the iron regulon during aging is unique to when Ssd1 is overexpressed not at endogenous Ssd1 levels. As such, it may not be the case that the short lifespan of ssd1 yeast is due to iron accumulation (if that happens); yeast lacking SSD1 also have cell wall biogenesis problems and the defects in cell wall biogenesis shorten the replicative lifespan (Molon et al., Biogerentology 2018  PMID 29189912). 

      (4) Figure 4: The authors could not analyze the impact of Ssd1 overexpression on the localization of GFP-Aft1 due to synthetic sickness. This was not observed under calorie restriction (CR) conditions and is therefore unexpected. Why should Ssd1 overexpression and CR have such diverse impacts on cellular physiology when combined with GFP-Aft1? Isn`t that observation arguing against CR and increased Ssd1 levels acting through the same pathway? A further clarification of this point is necessary.

      Without further experimentation, we can only speculate that cellular changes that are unique to overexpression of Ssd1 and not shared with CR cause a negative interaction with GFP-Aft1. Of note, Aft1 has functions in addition to its role in activating the iron regulon (aft1∆ strains have a growth defect independent from its role in iron regulon activation [27]) and we have shown previously that overexpressed Ssd1 has a reduction in global protein translation. Future experiments would be necessary to delineate the basis for this synthetic sickness.

      (5) Lowering Fe2+ levels upon Ssd1 overexpression is predicted to reduce oxidative stress. It is suggested to determine ROS levels upon Ssd1 overexpression to bolster that point.

      This is a great suggestion. The lowering of Fe2+ in the Ssd1 mutants is something that happens at the end of the lifespan and therefore we would need to do experiments to detect reduced ROS using a live dye on our microfluidics platform. We are not aware of any live fluorescent reporters of ROS.  

      Reviewer #2 (Recommendations for the authors):

      (1) Page 6, 7th line of Replicative lifespan analyses, there is a double bracket.

      This has been corrected. Thank you

      (2) Page 18, line 6 of "failure to activate..." section, "revered" should be replaced with "reversed".

      This has been corrected. Thank you

      (3) Page 23, fix writing on line 2 of "Effects of CR..." section.

      This has been corrected. Thank you

      (4) Page 24, Author contributions section, replace "performed devised" with "designed".

      This has been corrected. Thank you

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 3C: The panel legend is somewhat confusing due to the color scheme and the scattering of labels across panels. A more consistent labeling strategy would help readability.

      We agree, and the labelling has now been improved. Thank you. 

      (2) Figure 3D vs Figure 3B: it appears that Fit2 activation occurs substantially earlier than Aft1 translocation, which reduces the predictive value of Fit2 compared to Aft1. This is puzzling given that Fit2 is expected to be a direct target of Aft1. Could this discrepancy be related to the thresholding used for Fit2-mCherry display? The color scale in Figure 3D is also somewhat misleading, as most of the segments appear greenish. A continuous color gradient, perhaps restricted to the [10-120] interval, might give a clearer picture of iron regulon activation.

      For the Aft1-mcherry experiment, we are only able to accurately annotate nuclear localization when Aft1 has been fully (or mostly) translocated into the nucleus from the cytoplasm such that this data is likely to be on the conservative side. However, activation of the iron regulon likely occurs as Aft1 is translocated into the nucleolus, so a minimal initial amount of Aft1 (for which we don’t have enough resolution in this system to detect) could be enough for FIT2 and ARN1 induction.  By contrast, the Fit2 and Arn1 signal is measuring increase over a background of nothing, so is very easy to detect even at low level induction. To allow the readers to see all our data without over thresholding, we prefer to present the induction of Fit2 and Arn1 at all intensity levels even the very low level induction (green).

      (3) "In control strains, expression of Fit2 and Arn1 varied across the population, but generally increased with age": for the right panel, normalization might be more appropriate. What is the fold change in fluorescence during lifespan? Reporting ΔmCherry intensity alone does not provide a quantitative measure of induction.

      We have changed the figure to show quantitation as fold change, as suggested.

      (4) Figure 6 (model): The model figure is conceptually useful but not easy to follow in its current form; a revised schematic with a clearer depiction of the pathway activations at different replicative ages would be helpful.

      We have changed the figure to make the model more clear, as suggested.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Crohn's disease is a prevalent inflammatory bowel disease that often results in patient relapse post anti-TNF blockades. This study employs a multifaceted approach utilizing single-cell RNA sequencing, flow cytometry, and histological analyses to elucidate the cellular alterations in pediatric Crohn's disease patients pre and post-anti-TNF treatment and comparing them with non-inflamed pediatric controls. Utilizing an innovative clustering approach, the research distinguishes distinct cellular states that signify the disease's progression and response to treatment. Notably, the study suggests that the anti-TNF treatment pushes pediatric patients towards a cellular state resembling adult patients with persistent relapses. This study's depth offers a nuanced understanding of cell states in CD progression that might forecast the disease trajectory and therapy response.

      Robust Data Integration: The authors adeptly integrate diverse data types: scRNA-seq, histological images, flow cytometry, and clinical metadata, providing a holistic view of the disease mechanism and response to treatment.

      Novel Clustering Approach: The introduction and utilization of ARBOL, a tiered clustering approach, enhances the granularity and reliability of cell type identification from scRNA-seq data.

      Clinical Relevance: By associating scRNA-seq findings with clinical metadata, the study offers potentially significant insights into the trajectory of disease severity and anti-TNF response; which might help with the personalized treatment regimens.

      Treatment Dynamics: The transition of the pediatric cellular ecosystem towards an adult, more treatment-refractory state upon anti-TNF treatment is a significant finding. It would be beneficial to probe deeper into the temporal dynamics and the mechanisms underlying this transition.

      Comparative Analysis with Adult CD: The positioning of on-treatment biopsies between treatment-naïve pediCD and on-treatment adult CD is intriguing. A more in-depth exploration comparing pediatric and adult cellular ecosystems could provide valuable insights into disease evolution.

      Areas of improvement:

      (1) The legends accompanying the figures are quite concise. It would be beneficial to provide a more detailed description within the legends, incorporating specifics about the experiments conducted and a clearer representation of the data points. 

      We agree that it is beneficial to have descriptive figure legends that balance elements of experimental design, methodology, and statistical analyses employed in order to have a clear understanding throughout the manuscript. We have gone through and clarified areas throughout.  

      (2) Statistical significance is missing from Fig. 1c WBC count plot, Fig. 2 b-e panels. Please provide it even if it's not significant. Also, the legend should have the details of stat test used.

      We have now added details of statistical significance data in the Figure 1 legends. Please note that Mann-Whitney U-test was used for clinical categorical data.

      (3) In the study, the NOA group is characterized by patients who, after thorough clinical evaluations, were deemed to exhibit milder symptoms, negating the need for anti-TNF prescriptions. This mild nature could potentially align the NOA group closer to FGID-a condition intrinsically defined by its low to non-inflammatory characteristics. Such an alignment sparks curiosity: is there a marked correlation between these two groups? A preliminary observation suggesting such a relationship can be spotted in Figure 6, particularly panels A and B. Given the prevalence of FGID among the pediatric population, it might be prudent for the authors to delve deeper into this potential overlap, as insights gained from mild-CD cases could provide valuable information for managing FGID.

      Thank you for this insightful point. On histopathology and endoscopy, the NOA exhibited microscopic and macroscopic inflammation which landed these patients with the CD diagnosis, albeit mild on both micro and macro accounts. By contrast, the FGID group by definition will not have inflammation of microscopic and macroscopic evaluation. There is great interest in the field of adult and pediatric gastroenterology to understand why patients develop symptoms without evidence of inflammation. However, in 2023 the diagnostic tools of endoscopy with biopsy and histopathology is not sensitive enough to detect transcript level inflammation, positioning single-cell technology to be able to reveal further information in both disease processes.

      Based on the reviewer’s suggestions, we have calculated a heatmap of overlapping NOA and FGID cell states along the Figure 6a joint-PC1, showing where NOA CD patients and FGID patients overlap in terms of cell states. This is displayed in Supplemental Figure 15d. This revealed a set of T, Myeloid, and Epithelial cell states that were most important in describing variance along the FGID-CD axis, allowing us to hone in on similarities at the boundary between FGID and CD. By comparing the joint cell states with CD atlas curated cluster names, we identified CCR7-expressing T cell states and GSTA2-expressing epithelial states associated with this overlap. 

      (4) Furthermore, Figure 7 employs multi-dimensional immunofluorescence to compare CD, encompassing all its subtypes, with FGID. If the data permits, subdividing CD into PR, FR, and NOA for this comparison could offer a more nuanced understanding of the disease spectrum. Such a granular perspective is invaluable for clinical assessments. The key question then remains: do the sample categorizations for the immunofluorescence study accommodate this proposed stratification?

      Thank you for the thoughtful discussion. We agree that stratifying Crohn’s disease by PR, FR, and NOA would provide valuable clinical insight. Unfortunately our multiplex IF cohort was designed to maximize overall CD versus FGID comparisons and does not contain enough samples in patient subgroups to power such an analysis. We have highlighted this limitation in the text.  

      (5)The study's most captivating revelation is the proximity of anti-TNF-treated pediatric CD (pediCD) biopsies to adult treatment-refractory CD. Such an observation naturally raises the question: How does this alignment compare to a standard adult colon, and what proportion of this similarity is genuinely disease-specific versus reflective of an adult state? To what degree does the similarity highlight disease-specific traits?

      Delving deeper, it will be of interest to see whether anti-TNF treatment is nudging the transcriptional state of the cells towards a more mature adult stage or veering them into a treatment-resistant trajectory. If anti-TNF therapy is indeed steering cells toward a more adult-like state, it might signify a natural maturation process; however, if it's directing them toward a treatment-refractory state, the long-term therapeutic strategies for pediatric patients might need reconsideration.

      Thank you to the reviewer for another insightful point. We agree that age-matched samples are critical to evaluate disease cell states and hence we have age-matched controls in our pediatric cohort. Our timeline of follow-up only spans 3 years and patients remain in the pediatric age range at times of follow-up endoscopy and biopsy and would not be reflective of an adult GI state. We believe that the cellular behavior from naïve to treatment biopsy to on treatment biopsy is reflective of disease state rather than movement towards and adult-like state. We would also like to point out that pediatric onset IBD (Crohn’s and ulcerative colitis) traditionally has been harder to treat and presents with more extensive disease state (PMID: 22643596) and the ability to detect need for therapy escalation/change would be an invaluable tool for clinicians.  

      We share the reviewer’s interest in disentangling a natural maturation process from disease and treatment-specific changes. Because the patients who were not given treatment did not move towards the adult-like phenotype, it could point to a push towards a treatment-resistant trajectory. To further support these findings, we generated a new disease-pseudotime figure Supplemental Figure 17, using cross-validation methods and the TradeSeq package. This figure was designed to track how each pediatric sample shifts from the treatment-naïve state through antiTNF therapy and to test the robustness of these shifts across samples. The new visualizations show patterns that do not recapitulate natural aging processes but rather shifts across all cell types associated with antiTNF treatment.

      Reviewer #2 (Public Review):

      Summary:

      Through this study, the authors combine a number of innovative technologies including scRNAseq to provide insight into Crohn's disease. Importantly samples from pediatric patients are included. The authors develop a principled and unbiased tiered clustering approach, termed ARBOL. Through high-resolution scRNAseq analysis the authors identify differences in cell subsets and states during pediCD relative to FGID. The authors provide histology data demonstrating T cell localisation within the epithelium. Importantly, the authors find anti-TNF treatment pushes the pediatric cellular ecosystem toward an adult state.

      Strengths:

      This study is well presented. The introduction clearly explains the important knowledge gaps in the field, the importance of this research, the samples that are used, and study design.

      The results clearly explain the data, without overstating any findings. The data is well presented. The discussion expands on key findings and any limitations to the study are clearly explained.

      I think the biological findings from, and bioinformatic approach used in this study, will be of interest to many and significantly add to the field.

      Weaknesses:

      (1) The ARBOL approach for iterative tiered clustering on a specific disease condition was demonstrated to work very well on the datasets generated in this study where there were no obvious batch effects across patients. What if strong batch effects are present across donors where PCA fails to mitigate such effects? Are there any batch correction tools implemented in ARBOL for such cases?

      We thank the reviewer for their insightful point, the full extent to which ARBOL can address batch effects requires further study. To this end we integrated Harmony into the ARBOL architecture and used it in the paper to integrate a previous study with the data presented (Figure 8). We have added to ARBOL’s github README how to use Harmony with the automated clustering method. With ARBOL, as well as traditional clustering methods, batch effects can cause artifactual clustering at any tier of clustering. Due to iteration, this can cause batch effects to present themselves in a single round of clustering, followed by further rounds of clustering that appear highly similar within each batch subset. Harmony addresses this issue, removing these batch-related clustering rounds. The later arrangement of fine-grained clusters using the bottom-up approach can use the batch-corrected latent space to calculate relationships between cell states, removing the effects from both sides of the algorithm. As stated, the extent to which ARBOL can be used to systematically address these batch effects requires further research, but the algorithmic architecture of ARBOL is well suited to address these effects.

      (2) The authors mentioned that the clustering tree from the recursive sub-clustering contained too much noise, and they therefore used another approach to build a hierarchical clustering tree for the bottom-level clusters based on unified gene space. But in general, how consistent are these two trees?

      Thank you for this thoughtful question. The two tree methodologies are not consistent due to their algorithmic differences, but both are important for several reasons: 

      (1) The clustering tree is top-down, meaning low resolution lineage-related clusters are calculated first. Doublets and quality differences can cause very small clusters of different lineages (endothelial vs fibroblast) to fall under the incorrect lineage at first in the sub clustering tree, but these are recaptured during further sub clustering rounds, and then disentangled by the cluster-centroid tree.

      (2) The hierarchical tree is a rose tree, meaning each branching point can contain several daughter branches, while taxonomies based on distances between species (or cell types in this case) are binary trees with only 2 branches per branching point, because distances between each cluster are unique. Because this taxonomy, or bottom-up, is different from the top-down approach, it is useful to then look at how these bottom-level clusters are similar. To that end, we performed pair-wise differential expression between all end clusters and clustered based on those genes. 

      (3) Calculation of a binary tree represents a quantitative basis for comparing the transcriptomic distance between clusters as opposed to relying on distances calculated within a heuristic manifold such as UMAP or algorithmic similarity space such as cluster definitions based on KNN graphs.

      In practice, this dual view rescues small clusters that may have been mis-grouped by technical artifacts and gives a quantitative distance based hierarchy that can be compared across metadata covariates.

    1. Reviewer #1 (Public review):

      Summary:

      This paper reports model simulations and a human behavioral experiment studying predictive learning in a multidimensional environment. The authors claim that semantic biases help people resolve ambiguity about predictive relationships due to spurious correlations.

      Strengths:

      (1) The general question addressed by the paper is important.

      (2) The paper is clearly written.

      (3) Experiments and analyses are rigorously executed.

      Weaknesses:

      (1) Showing that people can be misled by spurious correlations, and that they can overcome this to some extent by using semantic structure, is not especially surprising to me. Related literature already exists on illusory correlation, illusory causation, superstitious behavior, and inductive biases in causal structure learning. None of this work features in the paper, which is rather narrowly focused on a particular class of predictive representations, which, in fact, may not be particularly relevant for this experiment. I also feel that the paper is rather long and complex for what is ultimately a simple point based on a single experiment.

      (2) Putting myself in the shoes of an experimental subject, I struggled to understand the nature of semantic congruency. I don't understand why the builder and terminal robots should have similar features is considered a natural semantic inductive bias. Humans build things all the time that look different from them, and we build machines that construct artifacts that look different from the machines. I think the fact that the manipulation worked attests to the ability of human subjects to pick up on patterns rather than supporting the idea that this reflects an inductive bias they brought to the experiment.

      (3) As the authors note, because the experiment uses only a single transition, it's not clear that it can really test the distinctive aspects of the SR/SF framework, which come into play over longer horizons. So I'm not really sure to what extent this paper is fundamentally about SFs, as it's currently advertised.

      (4) One issue with the inductive bias as defined in Equation 15 is that I don't think it will converge to the correct SR matrix. Thus, the bias is not just affecting the learning dynamics, but also the asymptotic value (if there even is one; that's not clear either). As an empirical model, this isn't necessarily wrong, but it does mess with the interpretation of the estimator. We're now talking about a different object from the SR.

      (5) Some aspects of the empirical and model-based results only provide weak support for the proposed model. The following null effects don't agree with the predictions of the model:

      (a) No effect of condition on reward.

      (b) No effect of condition on composition spurious predictiveness.

      (c) No effect of condition on the fitted bias parameter. The authors present some additional exploratory analyses that they use to support their claims, but this should be considered weaker support than the results of preregistered analyses.

      (6) I appreciate that the authors were transparent about which predictions weren't confirmed. I don't think they're necessarily deal-breakers for the paper's claims. However, these caveats don't show up anywhere in the Discussion.

      (7) I also worry that the study might have been underpowered to detect some of these effects. The preregistration doesn't describe any pilot data that could be used to estimate effect sizes, and it doesn't present any power analysis to support the chosen sample sizes, which I think are on the small side for this kind of study.

    1. Smith suggests that experimental data can help us better understand the causal mechanisms behind typological generalizations, something observational typological studies cannot do. We generally agree that some research setups are more adequate for investigating certain types of questions, and a division of labor, or triangulation, makes sense from this perspective. The difficulty emerges, again, with cases of disagreeing results between experimental and typological studies. Smith provides two very insightful examples of such cases. We will react to the first example, as it concerns a topic that we also explored in previous work, namely the relation between sociolinguistic factors and linguistic complexity (cf. Becker et al. 2023; Guzmán Naranjo et al. 2025). In both cases, we failed to find clear, convincing evidence for sociolinguistic correlates of linguistic complexity. In contrast, Smith (2024) reports on an artificial language learning experiment that supports the presence of mechanisms proposed in the typological literature to account for an association between sociolinguistic factors and linguistic complexity. In such a situation, the important question arises: how can we understand the discrepancy between the results? Smith mentions two hypotheses: (i) the factors identified in the experiments are outweighed by other factors in the wild, and (ii) natural language data cannot show the correlation with sufficient confidence. We agree, and we can think of a number of other potential explanations that can lead to the situation of finding an effect of, e.g., socio-linguistic factors on linguistic complexity in experimental studies but not in typological ones. We think that all these issues should be explored and subsequently discarded in order to understand diverging results: experimental studies: the experimental design may not be suitable the experimental study may not reflect natural language learning the data analysis of the experimental study may have issues typological studies: the study may not operationalize the actual socio-linguistic hypotheses well the data collection and annotation may contain too many mistakes the language sample may be too small to detect the (potentially weak) effects the language sample may be wrong in just the right way, hiding the effects the data analysis of the typological study may have issues These issues all highlight the possibility that either the experimental or typological studies could lead to fundamentally incorrect results. This goes back to our main point: we can only increase our confidence about our findings with more transparency about the work process, with robustness tests and with replication. If at some point we reach high confidence about results from both experimental and typological studies, and these still diverge, we can then start to think about how and why they diverge. Currently, we do not believe that we can have high certainty about our typological results regarding sociolinguistic effects on linguistic complexity to begin with. Therefore, we should be cautious when trying to interpret differences between the typological and experimental results.

      B&GN appreciate Smith’s contribution and agree on the importance of combining typology with cognitive experiments. Nevertheless, Smith talked about two types of mismatch between typological and experimental results, while B&GN say that there are many more possible explanations for mismatch (they list the methodological problems in both approaches). B&GN think we cannot blindly trust typological results yet, cause they can be uncertain.

    1. If one’s goal is primarily to document constraints on cross-linguistic variation then this is obviously deeply troubling. However, if the central interest is the cognitive and interactional mechanisms responsible for those constraints – what it is about the way languages are learned, used and transmitted that leads to convergent cultural evolution on recurring constellations of linguistic features (see e.g. Haspelmath 2019, 2021) – then this uncertainty may be less problematic than it first appears, since we should in any case be running controlled experiments to test hypotheses about those mechanisms. B&GN (Becker and Guzmán Naranjo 2025) refer to experimental approaches briefly in a footnote as “triangulation”, “the combination of different empirical approaches to study the same phenomenon in order to test how robust results are across methods and to, ideally, find converging evidence”. I think the value of experimental work lies not in providing some additional data from another source, but a fundamentally different kind of data which allows us to test cognitive and interactional mechanisms hypothesised to be responsible for potential universals. Being observational, no matter how rigorously conducted, analyses of typological data cannot speak to those causal mechanisms. However, the observational data from typology is a rich source of potential hypotheses about mechanisms shaping linguistic systems, which can subsequently be tested in controlled experiments that can go beyond correlation and speak to causality.

      According to Smith, analyses of typological data can be a source of potential hypotheses about the mechanisms shaping linguistic systems, but it cannot speak to those causal mechanisms. Here lies the value of experimental work —-> test cognitive and interactional mechanisms that may be the potential cause for universals.for this reason, unlike B&GN, Smith thinks this data shouldn’t be used only to test the robustness of the results about the same phenomenon.

    1. Author response:

      The following is the authors’ response to the current reviews.

      I thank the authors for their clarifications. The manuscript is much improved now, in my opinion. The new power spectral density plots and revised Figure 1 are much appreciated. However, there is one remaining point that I am unclear about. In the rebuttal, the authors state the following: "To directly address the question of whether the auditory signal was distracting, we conducted a follow-up MEG experiment. In this study, we observed a significant reduction in visual accuracy during the second block when the distractor was present (see Fig. 7B and Suppl. Fig. 1B), providing clear evidence of a distractor cost under conditions where performance was not saturated." 

      I am very confused by this statement, because both Fig. 7B and Suppl. Fig. 1B show that the visual- (i.e., visual target presented alone) has a lower accuracy and longer reaction time than visual+ (i.e., visual target presented with distractor). In fact, Suppl. Fig. 1B legend states the following: "accuracy: auditory- - auditory+: M = 7.2 %; SD = 7.5; p = .001; t(25) = 4.9; visual- - visual+: M = -7.6%; SD = 10.80; p < .01; t(25) = -3.59; Reaction time: auditory- - auditory +: M = -20.64 ms; SD = 57.6; n.s.: p = .08; t(25) = -1.83; visual- - visual+: M = 60.1 ms ; SD = 58.52; p < .001; t(25) = 5.23)." 

      These statements appear to directly contradict each other. I appreciate that the difficulty of auditory and visual trials in block 2 of MEG experiments are matched, but this does not address the question of whether the distractor was actually distracting (and thus needed to be inhibited by occipital alpha). Please clarify.

      We apologize for mixing up the visual and auditory distractor cost in our rebuttal. The reviewer is right in that our two statements contradict each other.

      To clarify: In the EEG experiment, we see significant distractor cost for auditory distractors in the accuracy (which can be seen in SUPPL Fig. 1A). We also see a faster reaction time with auditory distractors, which may speak to intersensory facilitation. As we used the same distractors for both experiments, it can be assumed that they were distracting in both experiments.

      In our follow-up MEG-experiment, as the reviewer stated, performance in block 2 was higher than in block 1, even though there were distractors present. In this experiment, distractor cost and learning effects are difficult to disentangle. It is possible that participants improved over time for the visual discrimination task in Block 1, as performance at the beginning was quite low. To illustrate this, we divided the trials of each condition into bins of 10 and plotted the mean accuracy in these bins over time (see Author response image 1). Here it can be seen that in Block 2, there is a more or less stable performance over time with a variation < 10 %. In Block 1, both for visual as well as auditory trials, an improvement over time can be seen. This is especially strong for visual trials, which span a difference of > 20%. Note that the mean performance for the 80-90 trial bin was higher than any mean performance observed in Block 2. 

      Additionally, the same paradigm has been applied in previous investigations, which also found distractor costs for the here-used auditory stimuli in blocked and non-blocked designs. See:

      Mazaheri, A., van Schouwenburg, M. R., Dimitrijevic, A., Denys, D., Cools, R., & Jensen, O. (2014). Region-specific modulations in oscillatory alpha activity serve to facilitate processing in the visual and auditory modalities. NeuroImage, 87, 356–362. https://doi.org/10.1016/j.neuroimage.2013.10.052

      Van Diepen, R & Mazaheri, A 2017, 'Cross-sensory modulation of alpha oscillatory activity: suppression, idling and default resource allocation', European Journal of Neuroscience, vol. 45, no. 11, pp. 1431-1438. https://doi.org/10.1111/ejn.13570

      Author response image 1.

      Accuracy development over time in the MEG experiment. During block 1, a performance increase over time can be observed for visual as well as for auditory stimuli. During Block 2, performance is stable over time. Data are presented as mean ± SEM. N = 27 (one participant was excluded from this analysis, as their trial count in at least one condition was below 90 trials).


      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      In this study, Brickwedde et al. leveraged a cross-modal task where visual cues indicated whether upcoming targets required visual or auditory discrimination. Visual and auditory targets were paired with auditory and visual distractors, respectively. The authors found that during the cue-to-target interval, posterior alpha activity increased along with auditory and visual frequency-tagged activity when subjects were anticipating auditory targets. The authors conclude that their results disprove the alpha inhibition hypothesis, and instead implies that alpha "regulates downstream information transfer." However, as I detail below, I do not think the presented data irrefutably disproves the alpha inhibition hypothesis. Moreover, the evidence for the alternative hypothesis of alpha as an orchestrator for downstream signal transmission is weak. Their data serves to refute only the most extreme and physiologically implausible version of the alpha inhibition hypothesis, which assumes that alpha completely disengages the entire brain area, inhibiting all neuronal activity.

      We thank the reviewer for taking the time to provide additional feedback and suggestions and we improved our manuscript accordingly.

      (1) Authors assign specific meanings to specific frequencies (8-12 Hz alpha, 4 Hz intermodulation frequency, 36 Hz visual tagging activity, 40 Hz auditory tagging activity), but the results show that spectral power increases in all of these frequencies towards the end of the cue-to-target interval. This result is consistent with a broadband increase, which could simply be due to additional attention required when anticipating auditory target (since behavioral performance was lower with auditory targets, we can say auditory discrimination was more difficult). To rule this out, authors will need to show a power spectral density curve with specific increases around each frequency band of interest. In addition, it would be more convincing if there was a bump in the alpha band, and distinct bumps for 4 vs 36 vs 40 Hz band.

      This is an interesting point with several aspects, which we will address separately

      Broadband Increase vs. Frequency-Specific Effects:

      The suggestion that the observed spectral power increases may reflect a broadband effect rather than frequency-specific tagging is important. However, Supplementary Figure 11 shows no difference between expecting an auditory or visual target at 44 Hz. This demonstrates that (1) there is no uniform increase across all frequencies, and (2) the separation between our stimulation frequencies was sufficient to allow differentiation using our method.

      Task Difficulty and Performance Differences:

      The reviewer suggests that the observed effects may be due to differences in task difficulty, citing lower performance when anticipating auditory targets in the EEG study. This issue was explicitly addressed in our follow-up MEG study, where stimulus difficulty was calibrated. In the second block—used for analysis—accuracy between auditory and visual targets was matched (see Fig. 7B). The replication of our findings under these controlled conditions directly rules out task difficulty as the sole explanation. This point is clearly presented in the manuscript.

      Power Spectrum Analysis:

      The reviewer’s suggestion that our analysis lacks evidence of frequency-specific effects is addressed directly in the manuscript. While we initially used the Hilbert method to track the time course of power fluctuations, we also included spectral analyses to confirm distinct peaks at the stimulation frequencies. Specifically, when averaging over the alpha cluster, we observed a significant difference at 10 Hz between auditory and visual target expectation, with no significant differences at 36 or 40 Hz in that cluster. Conversely, in the sensor cluster showing significant 36 Hz activity, alpha power did not differ, but both 36 Hz and 40 Hz tagging frequencies showed significant effects These findings clearly demonstrate frequency-specific modulation and are already presented in the manuscript.

      (2) For visual target discrimination, behavioral performance with and without the distractor is not statistically different. Moreover, the reaction time is faster with distractor. Is there any evidence that the added auditory signal was actually distracting?

      We appreciate the reviewer’s observation regarding the lack of a statistically significant difference in behavioral performance for visual target discrimination with and without the auditory distractor. While this was indeed the case in our EEG experiment, we believe the absence of an accuracy effect may be attributable to a ceiling effect, as overall visual performance approached 100%. This high baseline likely masked any subtle influence of the distractor.

      To directly address the question of whether the auditory signal was distracting, we conducted a follow-up MEG experiment. In this study, we observed a significant reduction in visual accuracy during the second block when the distractor was present (see Fig. 7B and Suppl. Fig. 1B), providing clear evidence of a distractor cost under conditions where performance was not saturated.

      Regarding the faster reaction times observed in the presence of the auditory distractor, this phenomenon is consistent with prior findings on intersensory facilitation. Auditory stimuli, which are processed more rapidly than visual stimuli, can enhance response speed to visual targets—even when the auditory input is non-informative or nominally distracting (Nickerson, 1973; Diederich & Colonius, 2008; Salagovic & Leonard, 2021). Thus, while the auditory signal may facilitate motor responses, it can simultaneously impair perceptual accuracy, depending on task demands and baseline performance levels.

      Taken together, our data suggest that the auditory signal does exert a distracting influence, particularly under conditions where visual performance is not at ceiling. The dual effect—facilitated reaction time but reduced accuracy—highlights the complexity of multisensory interactions and underscores the importance of considering both behavioral and neurophysiological measures.

      (3) It is possible that alpha does suppress task-irrelevant stimuli, but only when it is distracting. In other words, perhaps alpha only suppresses distractors that are presented simultaneously with the target. Since the authors did not test this, they cannot irrefutably reject the alpha inhibition hypothesis.

      The reviewer’s claim that we did not test whether alpha suppresses distractors presented simultaneously with the target is incorrect. As stated in the manuscript and supported by our data (see point 2), auditory distractors were indeed presented concurrently with visual targets, and they were demonstrably distracting. Therefore, the scenario the reviewer suggests was not only tested—it forms a core part of our design.

      Furthermore, it was never our intention to irrefutably reject the alpha inhibition hypothesis. Rather, our aim was to revise and expand it. If our phrasing implied otherwise, we have now clarified this in the manuscript. Specifically, we propose that alpha oscillations:

      (a) Exhibit cyclic inhibitory and excitatory dynamics;

      (b) Regulate processing by modulating transfer pathways, which can result in either inhibition or facilitation depending on the network context.

      In our study, we did not observe suppression of distractor transfer, likely due to the engagement of a supramodal system that enhances both auditory and visual excitability. This interpretation is supported by prior findings (e.g., Jacoby et al., 2012), which show increased visual SSEPs under auditory task load, and by Zhigalov et al. (2020), who found no trial-by-trial correlation between alpha power and visual tagging in early visual areas, despite a general association with attention.

      Recent evidence (Clausner et al., 2024; Yang et al., 2024) further supports the notion that alpha oscillations serve multiple functional roles depending on the network involved. These roles include intra- and inter-cortical signal transmission, distractor inhibition, and enhancement of downstream processing (Scheeringa et al., 2012; Bastos et al., 2015; Zumer et al., 2014). We believe the most plausible account is that alpha oscillations support both functions, depending on context.

      To reflect this more clearly, we have updated Figure 1 to present a broader signal-transfer framework for alpha oscillations, beyond the specific scenario tested in this study.

      We have now revised Figure 1 and several sentences in the introduction and discussion, to clarify this argument.

      L35-37: Previous research gave rise to the prominent alpha inhibition hypothesis, which suggests that oscillatory activity in the alpha range (~10 Hz) plays a mechanistic role in selective attention through functional inhibition of irrelevant cortical areas (see Fig. 1; Foxe et al., 1998; Jensen & Mazaheri, 2010; Klimesch et al., 2007).

      L60-65: In contrast, we propose that functional and inhibitory effects of alpha modulation, such as distractor inhibition, are exhibited through blocking or facilitating signal transmission to higher order areas (Peylo et al., 2021; Yang et al., 2023; Zhigalov & Jensen, 2020; Zumer et al., 2014), gating feedforward or feedback communication between sensory areas (see Fig. 1; Bauer et al., 2020; Haegens et al., 2015; Uemura et al., 2021).

      L482-485: This suggests that responsiveness of the visual stream was not inhibited when attention was directed to auditory processing and was not inhibited by occipital alpha activity, which directly contradicts the proposed mechanism behind the alpha inhibition hypothesis.

      L517-519: Top-down cued changes in alpha power have now been widely viewed to play a functional role in directing attention: the processing of irrelevant information is attenuated by increasing alpha power in areas involved with processing this information (Foxe, Simpson, & Ahlfors, 1998; Hanslmayr et al., 2007; Jensen & Mazaheri, 2010).

      L566-569: As such, it is conceivable that alpha oscillations can in some cases inhibit local transmission, while in other cases, depending on network location, connectivity and demand, alpha oscillation can facilitate signal transmission. This mechanism allows to increase transmission of relevant information and to block transmission of distractors.

      (4) In the abstract and Figure 1, the authors claim an alternative function for alpha oscillations; that alpha "orchestrates signal transmission to later stages of the processing stream." In support, the authors cite their result showing that increased alpha activity originating from early visual cortex is related to enhanced visual processing in higher visual areas and association areas. This does not constitute a strong support for the alternative hypothesis. The correlation between posterior alpha power and frequency-tagged activity was not specific in any way; Fig. 10 shows that the correlation appeared on both 1) anticipating-auditory and anticipating-visual trials, 2) the visual tagged frequency and the auditory tagged activity, and 3) was not specific to the visual processing stream. Thus, the data is more parsimonious with a correlation than a causal relationship between posterior alpha and visual processing.

      Again, the reviewer raises important points, which we want to address

      The correlation between posterior alpha power and frequency-tagged activity was not specific, as it is present both when auditory and visual targets are expected:

      If there is a connection between posterior alpha activity and higher-order visual information transfer, then it can be expected that this relationship remains across conditions and that a higher alpha activity is accompanied by higher frequency-tagged activity, both over trials and over conditions. However, it is possible that when alpha activity is lower, such as when expecting a visual target, the signal-to-noise ratio is affected, which may lead to higher difficulty to find a correlation effect in the data when using non-invasive measurements.

      The connection between alpha activity and frequency-tagged activity appears both for auditory as well as visual stimuli and The correlation is not specific to the visual processing stream:

      While we do see differences between conditions (e.g. in the EEG-analysis, mostly 36 Hz correlated with alpha activity and only in one condition 40 Hz showed a correlation as well), it is true that in our MEG analysis, we found correlations both between alpha activity and 36 Hz as well as alpha activity and 40 Hz.  

      We acknowledge that when analysing frequency-tagged activity on a trial-by-trial basis, where removal of non-timelocked activity through averaging (which we did when we tested for condition differences in Fig. 4 and 9) is not possible, there is uncertainty in the data. Baseline-correction can alleviate this issue, but it cannot offset the possibility of non-specific effects. We therefore decided to repeat the analysis with a fast-fourier calculated power instead of the Hilbert power, in favour of a higher and stricter frequency-resolution, as we averaged over a time-period and thus, the time-domain was not relevant for this analysis. In this more conservative analysis, we can see that only 36 Hz tagged activity when expecting an auditory target correlated with early visual alpha activity.

      Additionally, we added correlation analyses between alpha activity and frequency-tagged activity within early visual areas, using the sensor cluster which showed significant condition differences in alpha activity. Here, no correlations between frequency-tagged activity and alpha activity could be found (apart from a small correlation with 40 Hz which could not be confirmed by a median split; see SUPPL Fig. 14 C). The absence of a significant correlation between early visual alpha and frequency-tagged activity has previously been described by others (Zhigalov & Jensen, 2020) and a Bayes factor of below 1 also indicated that the alternative hypotheses is unlikely.

      Nonetheless, a correlation with auditory signal is possible and could be explained in different ways. For example, it could be that very early auditory feedback in early visual cortex (see for example Brang et al., 2022) is transmitted alongside visual information to higher-order areas. Several studies have shown that alpha activity and visual as well as auditory processing are closely linked together (Bauer et al., 2020; Popov et al., 2023). Inference on whether or how this link could play out in the case of this manuscript expands beyond the scope of this study.

      To summarize, we believe the fact that 36 Hz activity within early visual areas does not correlate with alpha activity on a trial-by-trial basis, but that 36 Hz activity in other areas does, provides strong evidence that alpha activity affects down-stream signal processing.

      We mention this analysis now in our discussion:

      L533-536: Our data provides evidence in favour of this view, as we can show that early sensory alpha activity does not covary over trials with SSEP magnitude in early visual areas, but covaries instead over trials with SSEP magnitude in higher order sensory areas (see also SUPPL. Fig. 14).

      Reviewer #1 (Recommendations for the authors):

      The evidence for the alternative hypothesis, that alpha in early sensory areas orchestrates downstream signal transmission, is not strong enough to be described up front in the abstract and Figure 1. I would leave it in the Discussion section, but advise against mentioning it in the abstract and Figure 1.

      We appreciate the reviewer’s concern regarding the inclusion of the alternative hypothesis—that alpha activity in early sensory areas orchestrates downstream signal transmission—in the abstract and Figure 1. While we agree that this interpretation is still developing, recent studies (Keitel et al., 2025; Clausner et al., 2024; Yang et al., 2024) provide growing support for this framework.

      In response, we have revised the introduction, discussion, and Figure 1 to clarify that our intention is not to outright dismiss the alpha inhibition hypothesis, but to refine and expand it in light of new data. This revision does not invalidate the prior literature on alpha timing and inhibition; rather, it proposes an updated mechanism that may better account for observed effects.

      We have though retained Figure 1, as it visually contextualizes the broader theoretical landscape. while at the same time added further analyses to strengthen our empirical support for this emerging view.

      References:

      Bastos, A. M., Litvak, V., Moran, R., Bosman, C. A., Fries, P., & Friston, K. J. (2015). A DCM study of spectral asymmetries in feedforward and feedback connections between visual areas V1 and V4 in the monkey. NeuroImage, 108, 460–475. https://doi.org/10.1016/j.neuroimage.2014.12.081

      Bauer, A. R., Debener, S., & Nobre, A. C. (2020). Synchronisation of Neural Oscillations and Cross-modal Influences. Trends in cognitive sciences, 24(6), 481–495. https://doi.org/10.1016/j.tics.2020.03.003

      Brang, D., Plass, J., Sherman, A., Stacey, W. C., Wasade, V. S., Grabowecky, M., Ahn, E., Towle, V. L., Tao, J. X., Wu, S., Issa, N. P., & Suzuki, S. (2022). Visual cortex responds to sound onset and offset during passive listening. Journal of neurophysiology, 127(6), 1547–1563. https://doi.org/10.1152/jn.00164.2021

      Clausner T., Marques J., Scheeringa R. & Bonnefond M (2024). Feature specific neuronal oscillations in cortical layers BioRxiv :2024.07.31.605816. https://doi.org/10.1101/2024.07.31.605816

      Diederich, A., & Colonius, H. (2008). When a high-intensity "distractor" is better then a low-intensity one: modeling the effect of an auditory or tactile nontarget stimulus on visual saccadic reaction time. Brain research, 1242, 219–230. https://doi.org/10.1016/j.brainres.2008.05.081

      Haegens, S., Nácher, V., Luna, R., Romo, R., & Jensen, O. (2011). α-Oscillations in the monkey sensorimotor network influence discrimination performance by rhythmical inhibition of neuronal spiking. Proceedings of the National Academy of Sciences of the United States of America, 108(48), 19377–19382. https://doi.org/10.1073/pnas.1117190108

      Jacoby, O., Hall, S. E., & Mattingley, J. B. (2012). A crossmodal crossover: opposite effects of visual and auditory perceptual load on steady-state evoked potentials to irrelevant visual stimuli. NeuroImage, 61(4), 1050–1058. https://doi.org/10.1016/j.neuroimage.2012.03.040

      Keitel, A., Keitel, C., Alavash, M., Bakardjian, K., Benwell, C. S. Y., Bouton, S., Busch, N. A., Criscuolo, A., Doelling, K. B., Dugue, L., Grabot, L., Gross, J., Hanslmayr, S., Klatt, L.-I., Kluger, D. S., Learmonth, G., London, R. E., Lubinus, C., Martin, A. E., … Kotz, S. A. (2025). Brain rhythms in cognition – controversies and future directions. ArXiv. https://doi.org/10.48550/arXiv.2507.15639

      Nickerson R. S. (1973). Intersensory facilitation of reaction time: energy summation or preparation enhancement?. Psychological review, 80(6), 489–509. https://doi.org/10.1037/h0035437

      Popov, T., Gips, B., Weisz, N., & Jensen, O. (2023). Brain areas associated with visual spatial attention display topographic organization during auditory spatial attention. Cerebral cortex (New York, N.Y. : 1991), 33(7), 3478–3489. https://doi.org/10.1093/cercor/bhac285

      Salagovic, C. A., & Leonard, C. J. (2021). A nonspatial sound modulates processing of visual distractors in a flanker task. Attention, perception & psychophysics, 83(2), 800–809. https://doi.org/10.3758/s13414-020-02161-5

      Scheeringa, R., Petersson, K. M., Kleinschmidt, A., Jensen, O., & Bastiaansen, M. C. (2012). EEG α power modulation of fMRI resting-state connectivity. Brain connectivity, 2(5), 254–264. https://doi.org/10.1089/brain.2012.0088

      Spaak, E., Bonnefond, M., Maier, A., Leopold, D. A., & Jensen, O. (2012). Layer-specific entrainment of γ-band neural activity by the α rhythm in monkey visual cortex. Current biology : CB, 22(24), 2313–2318. https://doi.org/10.1016/j.cub.2012.10.020

      Yang, X., Fiebelkorn, I. C., Jensen, O., Knight, R. T., & Kastner, S. (2024). Differential neural mechanisms underlie cortical gating of visual spatial attention mediated by alpha-band oscillations. Proceedings of the National Academy of Sciences of the United States of America, 121(45), e2313304121. https://doi.org/10.1073/pnas.2313304121

      Zhigalov, A., & Jensen, O. (2020). Alpha oscillations do not implement gain control in early visual cortex but rather gating in parieto-occipital regions. Human brain mapping, 41(18), 5176–5186. https://doi.org/10.1002/hbm.25183

      Zumer, J. M., Scheeringa, R., Schoffelen, J. M., Norris, D. G., & Jensen, O. (2014). Occipital alpha activity during stimulus processing gates the information flow to object-selective cortex. PLoS biology, 12(10), e1001965. https://doi.org/10.1371/journal.pbio.1001965

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We would like to thank all the reviewers for their valuable comments and criticisms. We have thoroughly revised the manuscript and the resource to address all the points raised by the reviewers. Below, we provide a point-by-point response for the sake of clarity.

      Reviewer #1

      __Evidence, reproducibility and clarity __

      Summary: This manuscript, "MAVISp: A Modular Structure-Based Framework for Protein Variant Effects," presents a significant new resource for the scientific community, particularly in the interpretation and characterization of genomic variants. The authors have developed a comprehensive and modular computational framework that integrates various structural and biophysical analyses, alongside existing pathogenicity predictors, to provide crucial mechanistic insights into how variants affect protein structure and function. Importantly, MAVISp is open-source and designed to be extensible, facilitating reuse and adaptation by the broader community.

      Major comments: - While the manuscript is formally well-structured (with clear Introduction, Results, Conclusions, and Methods sections), I found it challenging to follow in some parts. In particular, the Introduction is relatively short and lacks a deeper discussion of the state-of-the-art in protein variant effect prediction. Several methods are cited but not sufficiently described, as if prior knowledge were assumed. OPTIONAL: Extend the Introduction to better contextualize existing approaches (e.g., AlphaMissense, EVE, ESM-based predictors) and clarify what MAVISp adds compared to each.

      We have expanded the introduction on the state-of-the-art of protein variant effects predictors, explaining how MAVISp departs from them.

      - The workflow is summarized in Figure 1(b), which is visually informative. However, the narrative description of the pipeline is somewhat fragmented. It would be helpful to describe in more detail the available modules in MAVISp, and which of them are used in the examples provided. Since different use cases highlight different aspects of the pipeline, it would be useful to emphasize what is done step-by-step in each.

      We have added a concise, narrative description of the data flow for MAVISp, as well as improved the description of modules in the main text. We will integrate the results section with a more comprehensive description of the available modules, and then clarify in the case studies which modules were applied to achieve specific results.

      OPTIONAL: Consider adding a table or a supplementary figure mapping each use case to the corresponding pipeline steps and modules used.

      We have added a supplementary table (Table S2) to guide the reader on the modules and workflows applied for each case study

      We also added Table S1 to map the toolkit used by MAVISp to collect the data that are imported and aggregated in the webserver for further guidance.

      - The text contains numerous acronyms, some of which are not defined upon first use or are only mentioned in passing. This affects readability. OPTIONAL: Define acronyms upon first appearance, and consider moving less critical technical details (e.g., database names or data formats) to the Methods or Supplementary Information. This would greatly enhance readability.

      We revised the usage of acronyms following the reviewer’s directions of defying them at first appearance.

      • The code and trained models are publicly available, which is excellent. The modular design and use of widely adopted frameworks (PyTorch and PyTorch Geometric) are also strong points. However, the Methods section could benefit from additional detail regarding feature extraction and preprocessing steps, especially the structural features derived from AlphaFold2 models. OPTIONAL: Include a schematic or a table summarizing all feature types, their dimensionality, and how they are computed.

      We thank the reviewer for noticing and praising the availability of the tools of MAVISp. Our MAVISp framework utilizes methods and scores that incorporate machine learning features (such as EVE or RaSP), but does not employ machine learning itself. Specifically, we do not use PyTorch and do not utilize features in a machine learning sense. We do extract some information from the AlphaFold2 models that we use (such as the pLDDT score and their secondary structure content, as calculated by DSSP), and those are available in the MAVISp aggregated csv files for each protein entry and detailed in the Documentation section of the MAVISp website.

      • The section on transcription factors is relatively underdeveloped compared to other use cases and lacks sufficient depth or demonstration of its practical utility. OPTIONAL: Consider either expanding this section with additional validation or removing/postponing it to a future manuscript, as it currently seems preliminary.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      Minor comments: - Most relevant recent works are cited, including EVE, ESM-1v, and AlphaFold-based predictors. However, recent methods like AlphaMissense (Cheng et al., 2023) could be discussed more thoroughly in the comparison.

      We have revised the introduction to accommodate the proper space for this comparison.

      • Figures are generally clear, though some (e.g., performance barplots) are quite dense. Consider enlarging font sizes and annotating key results directly on the plots.

      We have revised Figure 2 and presented only one case study to simplify its readability. We have also changed Figure 3, whereas retained the other previous figures since they seemed less problematic.

      • Minor typographic errors are present. A careful proofreading is highly recommended. Below are some of the issues I identified: Page 3, line 46: "MAVISp perform" -> "MAVISp performs" Page 3, line 56: "automatically as embedded" -> "automatically embedded" Page 3, line 57: "along with to enhance" -> unclear; please revise Page 4, line 96: "web app interfaces with the database and present" -> "presents" Page 6, line 210: "to investigate wheatear" -> "whether" Page 6, lines 215-216: "We have in queue for processing with MAVISp proteins from datasets relevant to the benchmark of the PTM module." -> unclear sentence; please clarify Page 15, line 446: "Both the approaches" -> "Both approaches" Page 20, line 704: "advantage of multi-core system" -> "multi-core systems"

      We have done a proofreading of the entire article, including the points above

      Significance

      General assessment: the strongest aspects of the study are the modularity, open-source implementation, and the integration of structural information through graph neural networks. MAVISp appears to be one of the few publicly available frameworks that can easily incorporate AlphaFold2-based features in a flexible way, lowering the barrier for developing custom predictors. Its reproducibility and transparency make it a valuable resource. However, while the technical foundation is solid and the effort substantial, the scientific narrative and presentation could be significantly improved. The manuscript is dense and hard to follow in places, with a heavy use of acronyms and insufficient explanation of key design choices. Improving the descriptive clarity, especially in the early sections, would greatly enhance the impact of this work.

      Advance

      to the best of my knowledge, this is one of the first modular platforms for protein variant effect prediction that integrates structural data from AlphaFold2 with bioinformatic annotations and even clinical data in an extensible fashion. While similar efforts exist (e.g., ESMfold, AlphaMissense), MAVISp distinguishes itself through openness and design for reusability. The novelty is primarily technical and practical rather than conceptual.

      Audience

      this study will be of strong interest to researchers in computational biology, structural bioinformatics, and genomics, particularly those developing variant effect predictors or analyzing the impact of mutations in clinical or functional genomics contexts. The audience is primarily specialized, but the open-source nature of the tool may diffuse its use among more applied or translational users, including those working in precision medicine or protein engineering.

      Reviewer expertise: my expertise is in computational structural biology, molecular modeling, and (rather weak) machine learning applications in bioinformatics. I am familiar with graph-based representations of proteins, AlphaFold2, and variant effects based on Molecular Dynamics simulations. I do not have any direct expertise in clinical variant annotation pipelines.

      Reviewer #2

      __Evidence, reproducibility and clarity __

      Summary: The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments: - On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.

      We would like to thank the reviewer for pointing out these inconsistencies. We have revised all the entries and corrected them. If needed, the history of the cases that have been corrected can be found in the closed issues of the GitHub repository that we use for communication between biocurators and data managers (https://github.com/ELELAB/mavisp_data_collection). We have also revised the protocol we follow in this regard and the MAVISp toolkit to include better support for isoform matching in our pipelines for future entries, as well as for the revision/monitoring of existing ones, as detailed in the Method Section. In particular, we introduced a tool, uniprot2refseq, which aids the biocurator in identifying the correct match in terms of sequence length and sequence identity between RefSeq and UniProt. More details are included in the Method Section of the paper. The two relevant scripts for this step are available at: https://github.com/ELELAB/mavisp_accessory_tools/

      - The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are helpful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are specific indicators considered more 'reliable' than others?

      We have added a section in Results to clarify how to interpret results from MAVISp in the most common use cases.

      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.

      We thank the reviewer for spotting this inconsistency. This part in the main text was left over from a previous and preliminary version of the pre-print, we have revised the main text. Supplementary Text S4 includes the correct reference for the value in light of the benchmarking therewithin.

      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once. The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar. The same applies to the dataset window.

      We have changed the structure of the webserver in such a way that now the whole website opens as its own separate window, instead of being confined within the size permitted by the website at DTU. This solves the fixed window size issue. Hopefully, this will improve the user experience.

      We have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      • You are unable to copy anything out of the tables.
      • Hyperlinks in the tables only seem to work if you open them in a new tab or window.

      The table overhauls fixed both of these issues

      • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).

      We clarified the meaning of the reference column in the Documentation on the MAVISp website, as we realized it had confused the reviewer. The reference column is meant to cite the papers where the computationally-generated MAVISp data are used, not external sources. Since we also have the experimental data module in the most recent release, we have also refactored the MAVISp website by adding a “Datasets and metadata” page, which details metadata for key modules. These include references to data from external sources that we include in MAVISp on a case-by-case basis (for example the results of a MAVE experiment). Additionally, we have verified that the papers using MAVISp data are updated in https://elelab.gitbook.io/mavisp/overview/publications-that-used-mavisp-data and in the csv file of the interested proteins.

      Here below the current references that have been included in terms of publications using MAVISp data:

      SMPD1

      ASM variants in the spotlight: A structure-based atlas for unraveling pathogenic mechanisms in lysosomal acid sphingomyelinase

      Biochim Biophys Acta Mol Basis Dis

      38782304

      https://doi.org/10.1016/j.bbadis.2024.167260

      TRAP1

      Point mutations of the mitochondrial chaperone TRAP1 affect its functions and pro-neoplastic activity

      Cell Death & Disease

      40074754

      https://doi.org/10.1038/s41419-025-07467-6

      BRCA2

      Saturation genome editing-based clinical classification of BRCA2 variants

      Nature

      39779848

      0.1038/s41586-024-08349-1

      TP53, GRIN2A, CBFB, CALR, EGFR

      TRAP1 S-nitrosylation as a model of population-shift mechanism to study the effects of nitric oxide on redox-sensitive oncoproteins

      Cell Death & Disease

      37085483

      10.1038/s41419-023-05780-6

      KIF5A, CFAP410, PILRA, CYP2R1

      Computational analysis of five neurodegenerative diseases reveals shared and specific genetic loci

      Computational and Structural Biotechnology Journal

      38022694

      https://doi.org/10.1016/j.csbj.2023.10.031

      KRAS

      Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

      Brief Bioinform

      39708841

      https://doi.org/10.1093/bib/bbae664

      OPTN

      Decoding phospho-regulation and flanking regions in autophagy-associated short linear motifs

      Communications Biology

      40835742

      10.1038/s42003-025-08399-9

      DLG4,GRB2,SMPD1

      Deciphering long-range effects of mutations: an integrated approach using elastic network models and protein structure networks

      JMB

      40738203

      doi: 10.1016/j.jmb.2025.169359

      Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      During the table overhaul, we have revised the user interface to add a text box that allows free copy-pasting of mutation lists. While we understand having a single input box would have been ideal, the former selection interface (which is also still available) doesn’t allow copy-paste. This is a known limitation in Streamlit.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.

      We have done proofreading on the final version of the manuscript

      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.

      Yes, we are aware of this. It is far from trivial to properly import the datasets from multiplex assays. They often need to be treated on a case-by-case basis. We are in the process of carefully compiling locally all the MAVE data before releasing it within the public version of the database, so this is why they are missing. We are giving priorities to the ones that can be correlated with our predictions on changes in structural stability and then we will also cover the rest of the datasets handling them in batches. Having said this, we have checked the dataset for BRCA1, HRAS, and PPARG. We have imported the ones for PPARG and BRCA1 from ProtGym, referring to the studies published in 10.1038/ng.3700 and 10.1038/s41586-018-0461-z, respectively. Whereas for HRAS, checking in details both the available data and literature, while we did identify a suitable dataset (10.7554/eLife.27810), we struggled to understand what a sensible cut-off for discriminating between pathogenic and non-pathogenic variants would be, and so ended up not including it in the MAVISp dataset for now. We will contact the authors to clarify which thresholds to apply before importing the data.

      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.

      In the KRAS case study presented in MAVISP, we utilized the protein abundance dataset reported in (http://dx.doi.org/10.1038/s41586-023-06954-0) and made available in the ProteinGym repository (specifically referenced at https://github.com/OATML-Markslab/ProteinGym/blob/main/reference_files/DMS_substitutions.csv#L153). We adopted the precalculated thresholds as provided by the ProteinGym authors. In this regard, we are not really sure the reviewer is referring to this dataset or another one on KRAS.

      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).

      We improved the description of our classification strategies for both modules in the Documentation page of our website. Also, we explained more clearly the possible sources of ‘uncertain’ annotations for the two modules in both the web app (Documentation page) and main text. Briefly, in the STABILITY module, we consider FoldX and either Rosetta or RaSP to achieve a final classification. We first classify one and the other independently, according to the following strategy:

      If DDG ≥ 3, the mutation is Destabilizing If DDG ≤ −3, the mutation is Stabilizing If −2 We then compare the classifications obtained by the two methods: if they agree, then that is the final classification, if they disagree, then the final classification is Uncertain. The thresholds were selected based on a previous study, in which variants with changes in stability below 3 kcal/mol were not featuring a markedly different abundance at cellular level [10.1371/journal.pgen.1006739, 10.7554/eLife.49138]

      Regarding the LOCAL_INTERACTION module, it works similarly as for the Stability module, in that Rosetta and FoldX are considered independently, and an implicit classification is performed for each, according to the rules (values in kcal/mol)

      If DDG > 1, the mutation is Destabilizing. If DDG Each mutation is therefore classified for both methods. If the methods agree (i.e., if they classify the mutation in the same way), their consensus is the final classification for the mutation; if they do not agree, the final classification will be Uncertain.

      If a mutation does not have an associated free energy value, the relative solvent accessible area is used to classify it: if SAS > 20%, the mutation is classified as Uncertain, otherwise it is not classified.

      Thresholds here were selected according to best practices followed by the tool authors and more in general in the literature, as the reviewer also noticed.

      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?).

      We have revised the statements to avoid this confusion in the reader.

      • Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should be moved to the conclusions/future directions section.

      We have removed this section and included a mention in the conclusions as part of the future directions.

      • Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app.

      The reviewer’s interpretation on the second legend is correct - it does refer to the ClinVar classification. Nonetheless, we understand the positioning of the legend makes understanding what the legend refers to not obvious. We also revised the captions of the figures in the main text. On the web app, we have changed the location of the figure legend for the ClinVar effect category and added a label to make it clear what the classification refers to.

      • "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)" E25Q is benign in ClinVar and has had that status since first submitted.

      We have corrected this in the text and the statements related to it.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports. For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      We appreciate the interest in the gitbook resource that we also see as very valuable and one of the strengths of our work. We have now implemented a new strategy based on a Python script introduced in the mavisp toolkit to generate a template Markdown file of the report that can be further customized and imported into GitBook directly (​​https://github.com/ELELAB/mavisp_accessory_tools/). This should allow us to streamline the production of more reports. We are currently assigning proteins in batches for reporting to biocurator through the mavisp_data_collection GitHub to expand their coverage. Also, we revised the text and added a section on the interpretation of results from MAVISp. with a focus on the utility of the web-app and reports.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      While our website only displays the dataset per protein, the whole dataset, including all the MAVISp entries, is available at our OSF repository (https://osf.io/ufpzm/), which is cited in the paper and linked on the MAVISp website. We have further modified the MAVISp database to add a link to the repository in the modes page, so that it is more visible.

      My expertise. - I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Evidence, reproducibility and clarity:

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work correctly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window. In ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would like to explore the data myself and provide feedback on the user experience and utility.

      We have tried reproducing the issue mentioned by the reviewer, using the exact same Ubuntu and Firefox versions, but unfortunately failed to produce it. The website worked fine for us under such an environment. The issue experienced by the reviewer may have been due to either a temporary issue with the web server or a problem with the specific browser environment they were working in, which we are unable to reproduce. It would be useful to know the date that this happened to verify if it was a downtime on the DTU IT services side that made the webserver inaccessible.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      We appreciate the reviewer’s concerns about long-term sustainability. It is a fair point that we consider within our steering group, who oversee and plans the activities and meet monthly. Adding entries to MAVISp is moving more and more towards automation as we grow. We aim to minimize the manual work where applicable. Still, an expert-based intervention is really needed in some of the steps, and we do not want to renounce it. We intend to keep working on MAVISp to make the process of adding and updating entries as automated as possible, and to streamline the process when manual intervention is necessary. From the point of view of the biocurators, they have three core workflows to use for the default modules, which also automatically cover the source of annotations. We are currently working to streamline the procedures behind LOCAL_INTERACTION, which is the most challenging one. On the data manager and maintainers' side, we have workflows and protocols that help us in terms of automation, quality control, etc, and we keep working to improve them. Among these, we have workflows to use for the old entries updates. As an example, the update of erroneously attributed RefSeq data (pointed out by reviewer 2) took us only one week overall (from assigning revisions and importing to the database) because we have a reduced version of Snakemake for automation that can act on only the affected modules. Also, another point is that we have streamlined the generation of the templates for the gitbook reports (see also answer to reviewer 2).

      The update of old entries is planned and made regularly. We also deposit the old datasets on OSF for transparency, in case someone needs to navigate and explore the changes. We have activities planned between May and August every year to update the old entries in relation to changes of protocols in the modules, updates in the core databases that we interact with (COSMIC, Clinvar etc). In case of major changes, the activities for updates continue in the Fall. Other revisions can happen outside these time windows if an entry is needed or a specific research project and needs updates too.

      Furthermore, the community of people contributing to MAVISp as biocurators or developers is growing and we have scientists contributing from other groups in relation to their research interest. We envision that for this resource to scale up, our team cannot be the only one producing data and depositing it to the database. To facilitate this we launched a pilot for a training event online (see Event page on the website) and we will repeat it once per year. We also organize regular meetings with all the active curators and developers to plan the activities in a sustainable manner and address the challenges we encounter.

      As stated in the manuscript, currently with the team of people involved, automatization and resources that we have gathered around this initiative we can provide updates to the public database every third month and we have been regularly satisfied with them. Additionally, we are capable of processing from 20 to 40 proteins every month depending also on the needs of revision or expansion of analyses on existing proteins. We also depend on these data for our own research projects and we are fully committed to it.

      Additionally, we are planning future activities in these directions to improve scale up and sustainability:

      • Streamlining manual steps so that they are as convenient as fast as possible for our curators, e.g. by providing custom pages on the MAVISp website
      • Streamline and automatize the generation of useful output, for instance the reports, by using a combination of simple automation and large language models
      • Implement ways to share our software and scripts with third parties, for instance by providing ready made (or close to) containers or virtual machines
      • For a future version 2 if the database grows in a direction that is not compatible with Streamlit, the web data science framework we are currently using, we will rewrite the website using a framework that would allow better flexibility and performance, for instance using Django and a proper database backend. On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      We thank the reviewer for this comment - we are aware of the upcoming EOL of Python 3.9. We tested MAVISp, both software package and web server, using Python 3.10 (which is the minimum supported version going forward) and Python 3.13 (which is the latest stable release at the time of writing) and updated the instructions in the README file on the MAVISp GitHub repository accordingly.

      We plan on keeping track of Python and library versions during our testing and updating them when necessary. In the future, we also plan to deploy Continuous Integration with automated testing for our repository, making this process easier and more standardized.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      Since 2024, we have been reporting all previous versions of the dataset on OSF, the repository linked to the MAVISp website, at https://osf.io/ufpzm/files/osfstorage (folder: previous_releases). We prefer to keep everything under OSF, as we also use it to deposit, for example, the MD trajectory data.

      Additionally, in this GitHub page that we use as a space to interact between biocurators, developers, and data managers within the MAVISp community, we also report all the changes in the NEWS space: https://github.com/ELELAB/mavisp_data_collection

      Finally, the individual tools are all available in our GitHub repository, where version control is in place (see Table S1, where we now mapped all the resources used in the framework)

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. They should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      We revised the introduction in light of these suggestions. We have split the paragraph as recommended and added a longer second paragraph about VEPs and using structural data in the context of VEPs. We have also added the citation that the reviewer kindly recommended.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we can classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      We revised the statement in light of this comment from the reviewer

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      We have revised the text making the two intervals explicit, for better clarity.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset, and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      We have included the data from Mighell’s phosphatase assay as provided by MAVEdb in the MAVISp database, within the experimental_data module for PTEN, and we have revised the case study, including them and explaining better the decision of supporting both the ProteinGym and MAVEdb classification in MAVISp (when available). See revised Figure3, Table 1 and corresponding text.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      The reviewer is correct, we have revised the terminology we used in the manuscript and refers to VEPs (Variant Effect Predictors)

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      We have revised the website, adding a filtering option. In detail, we have refactored the web app by adding filtering functionality, both for the main protein table (that can now be filtered by UniProt AC, gene name, or RefSeq ID) and the mutations table. Doing this required a general overhaul of the table infrastructure (we changed the underlying engine that renders the tables).

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      We have revised and updated the data sources on the website, adding a metadata section with relevant information, including MaveDB references where applicable.

      Figure 2 is somewhat confusing, as it partially interleaves results from two different proteins. This would be nicer as two separate figures, one on each protein, or just of a single protein.

      As suggested by the reviewer, we have now revised the figure and corresponding legends and text, focusing only on one of the two proteins.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      We have revised Figure 3 to solve these issues and integrating new data from the comparison with the phosphatase assay

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      We have carefully proofread the paper for these inconsistencies

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      We have added the reference that the reviewer recommended

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      The assay mentioned in the paper refers to an experimental setup designed to investigate mutations that may confer resistance to the drug venetoclax. We started the first steps to implement a MAVISp module aimed at evaluating the impact of mutations on drug binding using alchemical free energy perturbations (ensemble mode) but we are far from having it complete. We expect to import these data when the module will be finalized since they can be used to benchmark it and BCL2 is one of the proteins that we are using to develop and test the new module.

      Reviewer #3 (Significance (Required)):

      Significance:

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      We have expanded the conclusions section to add a comparison and cite previously published work, and linked to a review we published last year that frames MAVISp in the context of computational frameworks for the prediction of variant effects. In brief, the Genomics 2 Proteins portal (G2P) includes data from several sources, including some overlapping with MAVISp such as Phosphosite or MAVEdb, as well as features calculated on the protein structure. ProtVar also aggregates mutations from different sources and includes both variant effect predictors and predictions of changes in stability upon mutation, as well as predictions of complex structures. These approaches are only partially overlapping with MAVISp. G2P is primarily focused on structural and other annotations of the effect of a mutation; it doesn’t include features about changes of stability, binding, or long-range effects, and doesn’t attempt to classify the impact of a mutation according to its measurements. It also doesn’t include information on protein dynamics. Similarly, ProtVar does include information on binding free energies, long effects, or dynamical information.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary:

      The authors present MAVISp, a tool for viewing protein variants heavily based on protein structure information. The authors have done a very impressive amount of curation on various protein targets, and should be commended for their efforts. The tool includes a diverse array of experimental, clinical, and computational data sources that provides value to potential users interested in a given target.

      Major comments:

      Unfortunately I was not able to get the website to work properly. When selecting a protein target in simple mode, I was greeted with a completely blank page in the app window, and in ensemble mode, there was no transition away from the list of targets at all. I'm using Firefox 140.0.2 (64-bit) on Ubuntu 22.04. I would have liked to be able to explore the data myself and provide feedback on the user experience and utility.

      I have some serious concerns about the sustainability of the project and think that additional clarifications in the text could help. Currently is there a way to easily update a dataset to add, remove, or update a component (for example, if a new predictor is published, an error is found in a predictor dataset, or a predictor is updated)? If it requires a new round of manual curation for each protein to do this, I am worried that this will not scale and will leave the project with many out of date entries. The diversity of software tools (e.g., three different pipeline frameworks) also seems quite challenging to maintain.

      On the same theme, according to the GitHub repository, the program relies on Python 3.9, which reaches end of life in October 2025. It has been tested against Ubuntu 18.04, which left standard support in May 2023. The authors should update the software to more modern versions of Python to promote the long-term health and maintainability of the project.

      I appreciate that the authors have made their code and data available. These artifacts should also be versioned and archived in a service like Zenodo, so that researchers who rely on or want to refer to specific versions can do so in their own future publications.

      In the introduction of the paper, the authors conflate the clinical challenges of variant classification with evidence generation and it's quite muddled together. The y should strongly consider splitting the first paragraph into two paragraphs - one about challenges in variant classification/clinical genetics/precision oncology and another about variant effect prediction and experimental methods. The authors should also note that they are many predictors other than AlphaMissense, and may want to cite the ClinGen recommendations (PMID: 36413997) in the intro instead.

      Also in the introduction on lines 21-22 the authors assert that "a mechanistic understanding of variant effects is essential knowledge" for a variety of clinical outcomes. While this is nice, it is clearly not the case as we are able to classify variants according to the ACMG/AMP guidelines without any notion of specific mechanism (for example, by combining population frequency data, in silico predictor data, and functional assay data). The authors should revise the statement so that it's clear that mechanistic understanding is a worthy aspiration rather than a prerequisite.

      In the structural analysis section (page 5, lines 154-155 and elsewhere), the authors define cutoffs with convenient round numbers. Is there a citation for these values or were these arbitrarily chosen by the authors? I would have liked to see some justification that these assignments are reasonable. Also there seems to be an error in the text where values between -2 and -3 kcal/mol are not assigned to a bin (I assume they should also be uncertain). There are other similar seemingly-arbitrary cutoffs later in the section that should also be explained.

      On page 9, lines 294-298 the authors talk about using the PTEN data from ProteinGym, rather than the actual cutoffs from the paper. They get to the latter later on, but I'm not sure why this isn't first? The ProteinGym cutoffs are somewhat arbitrarily based on the median rather than expert evaluation of the dataset and I'm not sure why it's even worth mentioning them when proper classifications are available. Regarding PTEN, it would be quite interesting to see a comparison of the VAMP-seq PTEN data and the Mighell phosphatase assay, which is cited on page 9 line 288 but is not actually a VAMP-seq dataset. I think this section could be interesting but it requires some additional attention.

      The authors mention "pathogenicity predictors" and otherwise use pathogenicity incorrectly throughout the manuscript. Pathogenicity is a classification for a variant after it has been curated according to a framework like the ACMG/AMP guidelines (Richards 2015 and amendments). A single tool cannot predict or assign pathogenicity - the AlphaMissense paper was wrong to use this nomenclature and these authors should not compound this mistake. These predictors should be referred to as "variant effect predictors" or similar, and they are able to produce evidence towards pathogenicity or benignity but not make pathogenicity calls themselves. For example, in Figure 4e, the terms "pathogenic" and "benign" should only be used here if these are the classifications the authors have derived from ClinVar or a similar source of clinically classified variants.

      Minor comments:

      The target selection table on the website needs some kind of text filtering option. It's very tedious to have to find a protein by scrolling through the table rather than typing in the symbol. This will only get worse as more datasets are added.

      The data sources listed on the data usage section of the website are not concordant with what is in the paper. For example, MaveDB is not listed.

      I found Figure 2 to be a bit confusing in that it partially interleaves results from two different proteins. I think this would be nicer as two separate figures, one on each protein, or just of a single protein.

      Figure 3 panel b is distractingly large and I wonder if the authors could do a little bit more with this visualization.

      Capitalization is inconsistent throughout the manuscript. For example, page 9 line 288 refers to VampSEQ instead of VAMP-seq (although this is correct elsewhere). MaveDB is referred to as MAVEdb or MAVEDB in various places. AlphaMissense is referred to as Alphamissense in the Figure 5 legend. The authors should make a careful pass through the manuscript to address this kind of issues.

      MaveDB has a more recent paper (PMID: 39838450) that should be cited instead of/in addition to Esposito et al.

      On page 11, lines 338-339 the authors mention some interesting proteins including BLC2, which has base editor data available (PMID: 35288574). Are there plans to incorporate this type of functional assay data into MAVISp?

      Significance

      General assessment:

      This is a nice resource and the authors have clearly put a lot of effort in. They should be celebrated for their achievments in curating the diverse datasets, and the GitBooks are a nice approach. However, I wasn't able to get the website to work and I have raised several issues with the paper itself that I think should be addressed.

      Advance:

      New ways to explore and integrate complex data like protein structures and variant effects are always interesting and welcome. I appreciate the effort towards manual curation of datasets. This work is very similar in theme to existing tools like Genomics 2 Proteins portal (PMID: 38260256) and ProtVar (PMID: 38769064). Unfortunately as I wasn't able to use the site I can't comment further on MAVISp's position in the landscape.

      Audience:

      MAVISp could appeal to a diverse group of researchers who are interested in the biology or biochemistry of proteins that are included, or are interested in protein variants in general either from a computational/machine learning perspective or from a genetics/genomics perspective.

      My expertise:

      I am an expert in high-throughput functional genomics experiments and am an experienced computational biologist with software engineering experience.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary:

      The authors present a pipeline and platform, MAVISp, for aggregating, displaying and analysis of variant effects with a focus on reclassification of variants of uncertain clinical significance and uncovering the molecular mechanisms underlying the mutations.

      Major comments:

      • On testing the platform, I was unable to look-up a specific variant in ADCK1 (rs200211943, R115Q). I found that despite stating that the mapped refseq ID was NP_001136017 in the HGVSp column, it was actually mapped to the canonical UniProt sequence (Q86TW2-1). NP_001136017 actually maps to Q86TW2-3, which is missing residues 74-148 compared to the -1 isoform. The Uniprot canonical sequence has no exact RefSeq mapping, so the HGVSp column is incorrect in this instance. This mapping issue may also affect other proteins and result in incorrect HGVSp identifiers for variants.
      • The paper lacks a section on how to properly interpret the results of the MAVISp platform (the case-studies are useful, but don't lay down any global rules for interpreting the results). For example: How should a variant with conflicts between the variant impact predictors be interpreted? Are certain indicators considered more 'reliable' than others?
      • In the Methods section, GEMME is stated as being rank-normalised with 0.5 as a threshold for damaging variants. On checking the data downloaded from the site, GEMME was not rank-normalised but rather min-max normalised. Furthermore, Supplementary text S4 conflicts with the methods section over how GEMME scores are classified, S4 states that a raw-value threshold of -3 is used.
      • Note. This is a major comment as one of the claims is that the associated web-tool is user-friendly. While functional, the web app is very awkward to use for analysis on any more than a few variants at once.
        • The fixed window size of the protein table necessitates excessive scrolling to reach your protein-of-interest. This will also get worse as more proteins are added. Suggestion: add a search/filter bar.
        • The same applies to the dataset window.
        • You are unable to copy anything out of the tables.
        • Hyperlinks in the tables only seem to work if you open them in a new tab or window.
        • All entries in the reference column point to the MAVISp preprint even when data from other sources is displayed (e.g. MAVE studies).
        • Entering multiple mutants in the "mutations to be displayed" window is time-consuming for more than a handful of mutants. Suggestion: Add a box where multiple mutants can be pasted in at once from an external document.

      Minor comments

      • Grammar. I appreciate that this manuscript may have been compiled by a non-native English speaker, but I would be remiss not to point out that there are numerous grammar errors throughout, usually sentence order issues or non-pluralisation. The meaning of the authors is mostly clear, but I recommend very thoroughly proof-reading the final version.
      • There are numerous proteins that I know have high-quality MAVE datasets that are absent in the database e.g. BRCA1, HRAS and PPARG.
      • Checking one of the existing MAVE datasets (KRAS), I found that the variants were annotated as damaging, neutral or given a positive score (these appear to stand-in for gain-of-function variants). For better correspondence with the other columns, those with positive scores could be labelled as 'ambiguous' or 'uncertain'.
      • Numerous thresholds are defined for stabilizing / destabilizing / neutral variants in both the STABILITY and the LOCAL_INTERACTION modules. How were these thresholds determined? I note that (PMC9795540) uses a ΔΔG threshold of 1/-1 for defining stabilizing and destabilizing variants, which is relatively standard (though they also say that 2-3 would likely be better for pinpointing pathogenic variants).
      • "Overall, with the examples in this section, we illustrate different applications of the MAVISp results, spanning from benchmarking purposes, using the experimental data to link predicted functional effects with structural mechanisms or using experimental data to validate the predictions from the MAVISp modules."

      The last of these points is not an application of MAVISp, but rather a way in which external data can help validate MAVISp results. Furthermore, none of the examples given demonstrate an application in benchmarking (what is being benchmarked?). - Transcription factors section. This section describes an intended future expansion to MAVISp, not a current feature, and presents no results. As such, it should probably be moved to the conclusions/future directions section. - Figures. The dot-plots generated by the web app, and in Figures 4, 5 and 6 have 2 legends. After looking at a few, it is clear that the lower legend refers to the colour of the variant on the X-axis - most likely referencing the ClinVar effect category. This is not, however, made clear either on the figures or in the app. - "We identified ten variants reported in ClinVar as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L, and E25Q, Fig.5a)"

      E25Q is benign in ClinVar and has had that status since first submitted.

      Significance

      Platforms that aggregate predictors of variant effect are not a new concept, for example dbNSFP is a database of SNV predictions from variant effect predictors and conservation predictors over the whole human proteome. Predictors such as CADD and PolyPhen-2 will often provide a summary of other predictions (their features) when using their platforms. MAVISp's unique angle on the problem is in the inclusion of diverse predictors from each of its different moules, giving a much wider perspective on variants and potentially allowing the user to identify the mechanistic cause of pathogenicity. The visualisation aspect of the web app is also a useful addition, although the user interface is somewhat awkward. Potentially the most valuable aspect of this study is the associated gitbook resource containing reports from biocurators for proteins that link relevant literature and analyse ClinVar variants. Unfortunately, these are only currently available for a small minority of the total proteins in the database with such reports.

      For improvement, I think that the paper should focus more on the precise utility of the web app / gitbook reports and how to interpret the results rather than going into detail about the underlying pipeline.

      In terms of audience, the fast look-up and visualisation aspects of the web-platform are likely to be of interest to clinicians in the interpretation of variants of unknown clinical significance. The ability to download the fully processed dataset on a per-protein database would be of more interest to researchers focusing on specific proteins or those taking a broader view over multiple proteins (although a facility to download the whole database would be more useful for this final group).

      My expertise.

      • I am a protein bioinformatician with a background in variant effect prediction and large-scale data analysis.
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Syed et al. investigate the circuit underpinnings for leg grooming in the fruit fly. They identify two populations of local interneurons in the right front leg neuromere of ventral nerve cord, i.e. 62 13A neurons and 64 13B neurons. Hierarchical clustering analysis identifies 10 morphological classes for both populations. Connectome analysis reveals their circuit interactions: these GABAergic interneurons provide synaptic inhibition either between the two subpopulations, i.e., 13B onto 13A, or among each other, i.e., 13As onto other 13As, and/or onto leg motoneurons, i.e., 13As and 13Bs onto leg motoneurons. Interestingly, 13A interneurons fall into two categories, with one providing inhibition onto a broad group of motoneurons, being called "generalists", while others project to a few motoneurons only, being called "specialists". Optogenetic activation and silencing of both subsets strongly affect leg grooming. As well aas ctivating or silencing subpopulations, i.e., 3 to 6 elements of the 13A and 13B groups, has marked effects on leg grooming, including frequency and joint positions, and even interrupting leg grooming. The authors present a computational model with the four circuit motifs found, i.e., feed-forward inhibition, disinhibition, reciprocal inhibition, and redundant inhibition. This model can reproduce relevant aspects of the grooming behavior.

      Strengths:

      The authors succeeded in providing evidence for neural circuits interacting by means of synaptic inhibition to play an important role in the generation of a fast rhythmic insect motor behavior, i.e., grooming. Two populations of local interneurons in the fruit fly VNC comprise four inhibitory circuit motifs of neural action and interaction: feed-forward inhibition, disinhibition, reciprocal inhibition, and redundant inhibition. Connectome analysis identifies the similarities and differences between individual members of the two interneuron populations. Modulating the activity of small subsets of these interneuron populations markedly affects the generation of the motor behavior, thereby exemplifying their important role in generating grooming.

      We thank the reviewer for their thoughtful and constructive evaluation of our work. 

      Weaknesses:

      Effects of modulating activity in the interneuron populations by means of optogenetics were conducted in the so-called closed-loop condition. This does not allow for differentiation between direct and secondary effects of the experimental modification in neural activity, as feedforward and feedback effects cannot be disentangled. To do so, open loop experiments, e.g., in deafferented conditions, would be important. Given that many members of the two populations of interneurons do not show one, but two or more circuit motifs, it remains to be disentangled which role the individual circuit motif plays in the generation of the motor behavior in intact animals.

      Our optogenetic experiments show a role for 13A/B neurons in grooming leg movements – in an intact sensorimotor system - but we cannot yet differentiate between central and reafferent contributions. Activation of 13As or 13Bs disinhibits motor neurons and that is sufficient to induce walking/grooming. Therefore, we can show a role for the disinhibition motif.

      Proprioceptive feedback from leg movements could certainly affect the function of these reciprocal inhibition circuits. Given the synapses we observe between leg proprioceptors and 13A neurons, we think this is likely.

      Our previous work (Ravbar et al 2021) showed that grooming rhythms in dusted flies persist when sensory feedback is reduced, indicating that central control is possible. In those experiments, we used dust to stimulate grooming and optogenetic manipulation to broadly silence sensory feedback. We cannot do the same here because we do not yet have reagents to separately activate sparse subsets of inhibitory neurons while silencing specific proprioceptive neurons. More importantly, globally silencing proprioceptors would produce pleiotropic effects and severely impair baseline coordination, making it difficult to distinguish whether observed changes reflect disrupted rhythm generation or secondary consequences of impaired sensory input. Therefore, the reviewer is correct – we do not know whether the effects we observe are feedforward (central), feedback sensory, or both. We have included this in the revised results and discussion section to describe these possibilities and the limits of our current findings.

      Additionally, we have used a computational model to test the role of each motif separately and we show that in the results.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Syed et al. presents a detailed investigation of inhibitory interneurons, specifically from the 13A and 13B hemilineages, which contribute to the generation of rhythmic leg movements underlying grooming behavior in Drosophila. After performing a detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits, the authors build on this anatomical framework by performing optogenetic perturbation experiments to functionally test predictions derived from the connectome. Finally, they integrate these findings into a computational model that links anatomical connectivity with behavior, offering a systems-level view of how inhibitory circuits may contribute to grooming pattern generation.

      Strengths:

      (1) Performing an extensive and detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits.

      (2) Making sense of the largely uncharacterized 13A/13B nerve cord circuitry by combining connectomics and optogenetics is very impressive and will lay the foundation for future experiments in this field.

      (3) Testing the predictions from experiments using a simplified and elegant model.

      We thank the reviewer for their thoughtful and encouraging evaluation of our work. 

      Weaknesses:

      (1) In Figure 4, while the authors report statistically significant shifts in both proximal inter-leg distance and movement frequency across conditions, the distributions largely overlap, and only in Panel K (13B silencing) is there a noticeable deviation from the expected 7-8 Hz grooming frequency. Could the authors clarify whether these changes truly reflect disruption of the grooming rhythm? 

      We reanalyzed the dataset with Linear Mixed Models. We find significant differences in mean frequencies upon silencing these neurons but not upon activation. The experimental groups are also significantly more variable. We revised these panels with updated analysis. We think these data do support our interpretation that the grooming rhythms are disrupted. 

      More importantly, all this data would make the most sense if it were performed in undusted flies (with controls) as is done in the next figure.

      In our assay conditions, undusted flies groom infrequently. We used undusted flies for some optogenetic activation experiments, where the neuron activation triggers behavior initiation, but we chose to analyze the effect of silencing inhibitory neurons in dusted flies because dust reliably activates mechanosensory neurons and elicits robust grooming behavior enabling us to assess how manipulation of 13A/B neurons alters grooming rhythmicity and leg coordination.

      (2) In Figure 4-Figure Supplement 1, the inclusion of walking assays in dusted flies is problematic, as these flies are already strongly biased toward grooming behavior and rarely walk. To assess how 13A neuron activation influences walking, such experiments should be conducted in undusted flies under baseline locomotor conditions.

      We agree that there are better ways to assay potential contributions of 13A/13B neurons to walking. We intended to focus on how normal activity in these inhibitory neurons affects coordination during grooming, and we included walking because we observed it in our optogenetic experiments and because it also involves rhythmic leg movements. The walking data is reported in a supplementary figure because we think this merits further study with assays designed to quantify walking specifically. We will make these goals clearer in the revised manuscript and we are happy to share our reagents with other research groups more equipped to analyze walking differences.

      (3) For broader lines targeting six or more 13A neurons, the authors provide specific predictions about expected behavioral effects-e.g., that activation should bias the limb toward flexion and silencing should bias toward extension based on connectivity to motor neurons. Yet, when using the more restricted line labeling only two 13A neurons (Figure 4 - Figure Supplement 2), no such prediction is made. The authors report disrupted grooming but do not specify whether the disruption is expected to bias the movement toward flexion or extension, nor do they discuss the muscle target. This is a missed opportunity to apply the same level of mechanistic reasoning that was used for broader manipulations.

      Because we cannot unambiguously identify one of the neurons from our sparsest 13A splitGAL4 lines in FANC, we cannot say with certainty which motor neurons they target. That limits the accuracy of any functional predictions.  

      (4) Regarding Figure 5: The 70ms on/off stimulation with a slow opsin seems problematic. CsChrimson off kinetics are slow and unlikely to cause actual activity changes in the desired neurons with the temporal precision the authors are suggesting they get. Regardless, it is amazing that the authors get the behavior! It would still be important for the authors to mention the optogenetics caveat, and potentially supplement the data with stimulation at different frequencies, or using faster opsins like ChrimsonR.

      We were also intrigued by the behavioral consequences of activating these inhibitory neurons with CsChrimson. We appreciate the reviewer’s point that CsChrimson’s slow off-kinetics limit precise temporal control. To address this, we repeated our frequency analysis using a range of pulse durations (10/10, 50/50, 70/70, 110/110, and 120/120 ms on/off) and compared the mean frequency of proximal joint extension/flexion cycles across conditions. We found no significant difference in frequency (LLMS, p > 0.05), suggesting that the observed grooming rhythm is not dictated by pulse period but instead reflects an intrinsic property of the premotor circuit once activated. We now include these results in ‘Figure 5—figure supplement 1’ and clarify in the text that we interpret pulsed activation as triggering, rather than precisely pacing, the endogenous grooming rhythm. We continue to note in the manuscript that CsChrimson’s slow off-kinetics may limit temporal precision. We will try ChrimsonR in future experiments.

      Overall, I think the strengths outweigh the weaknesses, and I consider this a timely and comprehensive addition to the field.

      Reviewer #3 (Public review):

      Summary:

      The authors set out to determine how GABAergic inhibitory premotor circuits contribute to the rhythmic alternation of leg flexion and extension during Drosophila grooming. To do this, they first mapped the ~120 13A and 13B hemilineage inhibitory neurons in the prothoracic segment of the VNC and clustered them by morphology and synaptic partners. They then tested the contribution of these cells to flexion and extension using optogenetic activation and inhibition and kinematic analyses of limb joints. Finally, they produced a computational model representing an abstract version of the circuit to determine how the connectivity identified in EM might relate to functional output. The study, in its current form, makes an important but overclaimed contribution to the literature due to a mismatch between the claims in the paper and the data presented.

      Strengths:

      The authors have identified an interesting question and use a strong set of complementary tools to address it:

      (1) They analysed serial‐section TEM data to obtain reconstructions of every 13A and 13B neuron in the prothoracic segment. They manually proofread over 60 13A neurons and 64 13B neurons, then used automated synapse detection to build detailed connectivity maps and cluster neurons into functional motifs.

      (2) They used optogenetic tools with a range of genetic driver lines in freely behaving flies to test the contribution of subsets of 13A and 13B neurons.

      (3) They used a connectome-constrained computational model to determine how the mapped connectivity relates to the rhythmic output of the behavior.

      Weaknesses:

      The manuscript aims to reveal an instructive, rhythm-generating role for premotor inhibition in coordinating the multi-joint leg synergies underlying grooming. It makes a valuable contribution, but currently, the main claims in the paper are not well-supported by the presented evidence.

      Major points

      (1) Starting with the title of this manuscript, "Inhibitory circuits generate rhythms for leg movements during Drosophila grooming", the authors raise the expectation that they will show that the 13A and 13B hemilineages produce rhythmic output that underlies grooming. This manuscript does not show that. For instance, to test how they drive the rhythmic leg movements that underlie grooming requires the authors to test whether these neurons produce the rhythmic output underlying behavior in the absence of rhythmic input. Because the optogenetic pulses used for stimulation were rhythmic, the authors cannot make this point, and the modelling uses a "black box" excitatory network, the output of which might be rhythmic (this is not shown). Therefore, the evidence (behavioral entrainment; perturbation effects; computational model) is all indirect, meaning that the paper's claim that "inhibitory circuits generate rhythms" rests on inferred sufficiency. A direct recording (e.g., calcium imaging or patch-clamp) from 13A/13B during grooming - outside the scope of the study - would be needed to show intrinsic rhythmogenesis. The conclusions drawn from the data should therefore be tempered. Moreover, the "black box" needs to be opened. What output does it produce? How exactly is it connected to the 13A-13B circuit? 

      We modified the title to better reflect our strongest conclusions: “Inhibitory circuits control leg movements during Drosophila grooming”

      Our optogenetic activation was delivered in a patterned (70 ms on/off) fashion that entrains rhythmic movements, but this does not rule out the possibility that the rhythm is imposed externally. In the manuscript, we state that we used pulsed light to mimic a flexion-extension cycle and note that this approach tests whether inhibition is sufficient to drive rhythmic leg movements when temporally patterned. While this does not prove that 13A/13B neurons are intrinsic rhythm generators, it does demonstrate that activating subsets of inhibitory neurons is sufficient to elicit alternating leg movements resembling natural grooming and walking.

      Our goal with the model was to demonstrate that it is possible to produce rhythmic outputs with this 13A/B circuit, based on the connectome. The “black box” is a small recurrent neural network (RNN) consisting of 40 neurons in its hidden layer. The inputs are the “dust” levels from the environment (the green pixels in Figure 6I), the “proprioceptive” inputs (“efference copy” from motor neurons), and the amount of dust accumulated on both legs. The outputs (all positive) connect to the 13A neurons, the 13B neurons, and to the motor neurons. We refer to it as the “black box” because we make no claims about the actual excitatory inputs to these circuits. Its function is to provide input, needed to run the network, that reflects the distribution of “dust” in the environment as well as the information about the position of the legs.  

      The output of the “black box” component of the model might be rhythmic. In fact, in most instances of the model implementation this is indeed the case. However, as mentioned in the current version of the manuscript: “But the 13A circuitry can still produce rhythmic behavior even without those external inputs (or when set to a constant value), although the legs become less coordinated.” Indeed, when we refine the model (with the evolutionary training) without the “black box” (using a constant input of 0.1) the behavior is still rhythmic and sustained. Therefore, the rhythmic activity and behavior can emerge from the premotor circuitry itself without a rhythmic input.

      The context in which the 13A and 13B hemilineages sit also needs to be explained. What do we know about the other inputs to the motorneurons studied? What excitatory circuits are there? 

      We agree that there are many more excitatory and inhibitory, direct and indirect, connections to motor neurons that will also affect leg movements for grooming and walking. 13A neurons provide a substantial fraction of premotor input. For example, 13As account for ~17.1% of upstream synapses for one tibia extensor (femur seti) motor neuron and ~14.6% for another tibia extensor (femur feti) motor neuron. Our goal was to demonstrate what is possible from a constrained circuit of inhibitory neurons that we mapped in detail, and we hope to add additional components to better replicate the biological circuit as behavioral and biomechanical data is obtained by us and others.  

      Furthermore, the introduction ignores many decades of work in other species on the role of inhibitory cell types in motor systems. There is some mention of this in the discussion, but even previous work in Drosophila larvae is not mentioned, nor crustacean STG, nor any other cell types previously studied. This manuscript makes a valuable contribution, but it is not the first to study inhibition in motor systems, and this should be made clear to the reader.

      We thank the reviewer for this important reminder.  Previous work on the contribution of inhibitory neurons to invertebrate motor control certainly influenced our research. We have expanded coverage of the relevant history and context in our revised discussion.

      (2) The experimental evidence is not always presented convincingly, at times lacking data, quantification, explanation, appropriate rationales, or sufficient interpretation.

      We are committed to improving the clarity, rationale, and completeness of our experimental descriptions.  We have revisited the statistical tests applied throughout the manuscript and expanded the Methods.

      (3) The statistics used are unlike any I remember having seen, essentially one big t-test followed by correction for multiple comparisons. I wonder whether this approach is optimal for these nested, high‐dimensional behavioral data. For instance, the authors do not report any formal test of normality. This might be an issue given the often skewed distributions of kinematic variables that are reported. Moreover, each fly contributes many video segments, and each segment results in multiple measurements. By treating every segment as an independent observation, the non‐independence of measurements within the same animal is ignored. I think a linear mixed‐effects model (LMM) or generalized linear mixed model (GLMM) might be more appropriate.

      We thank the reviewer for raising this important point regarding the statistical treatment of our segmented behavioral data. Our initial analysis used independent t-tests with Bonferroni correction across behavioral classes and features, which allowed us to identify broad effects. However, we acknowledge that this approach does not account for the nested structure of the data. To address this, we re-analyzed key comparisons using linear mixed-effects models (LMMs) as suggested by the reviewer. This approach allowed us to more appropriately model within-fly variability and test the robustness of our conclusions. We have updated the manuscript based on the outcomes of these analyses.

      (4) The manuscript mentions that legs are used for walking as well as grooming. While this is welcome, the authors then do not discuss the implications of this in sufficient detail. For instance, how should we interpret that pulsed stimulation of a subset of 13A neurons produces grooming and walking behaviours? How does neural control of grooming interact with that of walking?

      We do not know how the inhibitory neurons we investigated will affect walking or how circuits for control of grooming and walking might compete. We speculate that overlapping pre-motor circuits may participate because both have similar extension flexion cycles at similar frequencies, but we do not have hard experimental data to support. This would be an interesting area for future research. Here, we focused on the consequences of activating specific 13A/B neurons during grooming because they were identified through a behavioral screen for grooming disruptions, and we had developed high-resolution assays and familiarity with the normal movements in this behavior.

      (5) The manuscript needs to be proofread and edited as there are inconsistencies in labelling in figures, phrasing errors, missing citations of figures in the text, or citations that are not in the correct order, and referencing errors (examples: 81 and 83 are identical; 94 is missing in text).

      We have proofread the manuscript to fix figure labeling, citation order, and referencing errors.

      Reviewing Editor Comments:

      In addition to the recommendations listed below, a common suggestion, given the lack of evidence to support that 13A and 13B are rhythm-generating, is to tone down the title to something like, for example, "Inhibitory circuits control leg movements during grooming in Drosophila" (or similar).

      We changed the title to Inhibitory circuits control leg movements during Drosophila  grooming

      Reviewer #1 (Recommendations for the authors):

      (1) Naming of movements of leg segments:

      The authors refer to movements of leg segments across the leg, i.e., of all joints, as "flexion" and "extension". For example, in Figure 4A and at many other places. This naming is functionally misleading for two reasons: (i) the anatomical organization of an insect leg differs in principle from the organization of the mammalian leg, which the manuscript often refers to. While the organization of a mammalian limb is planar the organization of the insect limb shows a different plane as compared to the body length axis (for detailed accounts see Ritzmann et al. 2004; Büschges & Ache, 2024); (ii) the reader cannot differentiate between places in the text, where "flexion" and "extension" refer to movements of the tibia of the femur-tibia joint, e.g. in the graphical abstract, in Figure 3 and its supplements, and other places, e.g. Figure 4 and its supplements, where these two words refer to movements of leg segments of other joints, e.g. thorax-coxa, coxa-trochanter and tarsal joints. The reviewer strongly suggests naming the movements of the leg segments according to the individual joint and its muscles.

      We accept this helpful suggestion. We now include a description of the leg segments and joints in the revised Introduction and refer to which leg segments we mean   

      “The adult Drosophila leg consists of serially arranged joints—bodywall/thoraco-coxal (Th-C), coxa–trochanter (C-Tr), trochanter–femur (Tr-F), femur–tibia (F-Ti), tibia–tarsus (Ti-Ta)—each powered by opposing flexor and extensor muscles that transmit force through tendons (Soler et al., 2004). The proximal joints, Th-C and C-Tr, mediate leg protraction–retraction and elevation–depression, respectively (Ritzmann et al., 2004; Büschges & Ache, 2025). The medial joint, F-Ti, acts as the principal flexion–extension hinge and is controlled by large tibia extensor motor neurons and flexor motor neurons (Soler et al., 2004; Baek and Mann 2009; Brierley et al., 2012; Azevedo et al., 2024; Lesser et al., 2024). By contrast, distal joints such as Ti-Ta and the tarsomeres contribute to fine adjustments, grasping, and substrate attachment (Azevedo et al., 2024).”

      We also clarified femur-tibia joints in the graphical abstract, modified Figure 3 legend and added joints at relevant places.

      (2)  Figures 3, 4, and 5 with supplements:

      The authors optogenetically silence and activate (sub)populations of 13A and 13B interneurons. Changes in frequency of movements and distance between legs or leg movements are interpreted as the effect of these experimental paradigms. No physiological recordings from leg motoneurons or leg muscles are shown. While I understand the notion of the authors to interpret a movement as the outcome of activity in a muscle, it needs to be remembered that it is well known that fast cyclic leg movements, including those for grooming, cannot be used to conclude on the underlying neural activity. Zakotnik et al. (2006) and others provided evidence that such fast cyclic movements can result from the interaction of the rhythmic activity of one leg muscle only, together with the resting tension of its silent antagonist. Given that no physiological recordings are presented, this needs to be mentioned in the discussion, e.g., in the section "Inhibitory Innervation Imbalance.......".

      Added studies from Heitler, 1974; Bennet-Clark, 1975; Zakotnik et al., 2006; Page et al., 2008 in discussion.

      (3) Introduction and Discussion:

      The authors refer extensively to work on the mammalian spinal cord and compare their own work with circuit elements found in the spinal cord. From the perspective of the reviewer this notion is in conflict with acknowledging prior research work on the role of inhibitory network interactions for other invertebrates and lower vertebrates: such are locust flight system (for feedforward inhibition, disinhibition), crustacean stomatogastric nervous system (reciprocal inhibition), clione swimming system (reciprocal inhibition, feedforward inhibition, disinhibition), leech swimming system (reciprocal inhibition, disinhibition, feedforward inhibition), xenopus swimming system (reciprocal inhibition). The next paragraph illustrates this criticism/suggestion for stick insect neural circuits for leg stepping.

      (4) Discussion:

      "Feedforward inhibition" and "Disinhibition": it is already been described that rhythmic activity of antagonistic insect leg motoneuron pools arises from alternating synaptic inhibition and disinhibition of the motoneurons from premotor central pattern generating networks, e.g., Büschges (1998); Büschges et al. (2004); Ruthe et al. (2024).

      We have added these references to the revised Discussion.

      (5) Circuit motifs of the simulation, i.e., mutual inhibition between interneurons and onto motoneurons and sensory feedback influences and pathways share similarities to those formerly used by studies simulating rhythmic insect leg movements, for example, Schilling & Cruse 2020, 2023 or Toth et al. 2012. For the reader, it appears relevant that the progress of the new simulation is explained in the light of similarities and differences to these former approaches with respect to the common circuit motifs used.

      We now put our work in the context of other models in the Discussion section: “Similar circuit motifs, namely reciprocal inhibitions between pre-motor neurons and the sensory feedback have been modeled before, in particular neuroWalknet, and such simple motifs do not require a separate CPG component to generate rhythmic behavior in these models (Schilling & Cruse 2020, 2023). However, our model is much simpler than the neuroWalknet - it controls a 2D agent operating on an abstract environment (the dust distribution), without physics. In real animals or complex mechanical models such as NeuroMechFly (Lobato-Rios et al), a more explicit central rhythm generation may be advantageous for the coordination across many more degrees of freedom.”

      Reviewer #2 (Recommendations for the authors):

      I might have missed this, but I couldn't find any mention of how the grooming command pathways, described by previous work from the authors' lab, recruit these predicted grooming pattern-generating neurons. This should be mentioned in the connectome analysis and also discussed later in the discussion.

      13A neurons are direct downstream targets of previously described grooming command neurons. Specifically, the antennal grooming command neuron aDN (Hampel et al., 2015) synapses onto two primary 13As (γ and α; 13As-i) that connect to proximal extensor and medial flexor motor neurons, as well as four other 13As (9a, 9c, 9i, 6e) projecting to body wall extensor motor neurons. The 13As-i also form reciprocal connections with 13As-ii, providing a potential substrate for oscillatory leg movements. aDN connects to homologous 13As on both sides, consistent with the bilateral coordination needed for antennal sweeping. 

      The head grooming/leg rubbing command neuron DNg12 (Guo et al., 2022)  synapses directly onto ~50 13As, predominantly those connected to proximal motor neurons. 

      While sometimes the structural connectivity suggests pathways for generating rhythmic movements, the extensive interconnections among command neurons and premotor circuits indicate that multiple motifs could contribute to the observed behaviors. Further work will be needed to determine how these inputs are dynamically engaged during normal grooming sequences. We have now added it to the discussion.

      I encourage the authors to be explicit about caveats wherever possible: e.g., ectopic expression in genetic tools, potential for other unexplored neurons as rhythm generators (rather than 13A/B), given that the authors never get complete silencing phenotypes, CsChrimson kinetics, neurotransmitter predictions, etc.

      We now explain these caveats as follows: Ectopic expression is noted in Figure 1—figure supplement 1, and we added the following to the Discussion: “While our experiments with multiple genetic lines labeling 13A/B neurons consistently implicate these cells in leg coordination, ectopic expression in some lines raises the possibility that other neurons may also contribute to this phenotype. In addition, other excitatory and inhibitory neural circuits, not yet identified, may also contribute to the generation of rhythmic leg movements. Future studies should identify such neurons that regulate rhythmic timing and their interactions with inhibitory circuits.”

      We also added a caveat regarding CsChrimson kinetics in the Results. Finally, our identification of these neurons as inhibitory is based on genetic access to the GABAergic population (we use GAD-spGAL4 as part of the intersection which targets them), rather than on predictions of neurotransmitter identity.

      Reviewer #3 (Recommendations for the authors):

      Detailed list of figure alterations:

      (1) Figure 1:

      (a) Figure 1B and Figure 1 - Figure Supplement 1 lack information on individual cells - how can we tell that the cells targeted are indeed 13A and 13B, and which ones they are? Since off-target expression in neighboring hemilineages isn't ruled out, the interpretation of results is not straightforward.

      The neurons labeled by R35G04-DBD and GAD1-AD are identified as 13A and 13B based on their stereotyped cell body positions and characteristic neurite projections into the neuropil, which match those of 13A and 13B neurons reconstructed in the FANC and MANC connectome. While we have not generated flip-out clones in this genotype, we do isolate 13A neurons more specifically later in the manuscript using R35G04-DBD intersected with Dbx-AD, and show single-cell morphology consistent with identified 13A neurons. The purpose of including this early figure was to motivate the study by showing that silencing this population, which includes 13A/13B neurons, strongly reduces grooming in dusted flies. 

      Regarding Figure 1—Figure Supplement 1:

      This figure showed the expression patterns of all lines used throughout the manuscript. Panels C and D illustrated lines with minimal to no ectopic expression. Panels A and B show neurons with posterior cell bodies that may correspond to 13A neurons not reconstructed in our dataset but described in Soffers et al., 2025 and Marin et al., 2025 and we have provided detailed information about all VNC expressions in the figure legend.

      (b) Figure 1D lacks explanation of boxplots, asterisks, genotypes/experimental design.

      Added.

      (c) Figures 1E-F and video 1 lack quantification, scale bars.

      Added quantification.

      (2) Figure 2:

      (a) Figure 2A, Figure 2 - Supplement 3: What are the details of the hierarchical clustering? What metric was used to decide on the number of clusters? 

      We have used FANC packages to perform NBLAST clustering (Azevedo et al., 2024, Nature). We now include the full protocol in Methods.  The details are as follows:

      We performed hierarchical clustering on pairwise NBLAST similarity scores computed using navis.nblast_allbyall(). The resulting similarity matrix was symmetrized by averaging it with its transpose, and converted into a distance matrix using the transformation:

      distance=(1−similarity)\text{distance} = (1 - \text{similarity})distance=(1−similarity)

      This ensures that a perfect NBLAST match (similarity = 1) corresponds to a distance of 0.

      Clustering was performed using Ward’s linkage method (method='ward' in scipy.cluster.hierarchy.linkage), which minimizes the total within-cluster variance and is well-suited for identifying compact, morphologically coherent clusters.

      We did not predefine the number of clusters. Instead, clusters were visualized using a dendrogram, where branch coloring is based on the default behavior of scipy.cluster.hierarchy.dendrogram(). By default, this function applies a visual color threshold at 70% of the maximum linkage distance to highlight groups of similar elements. In our dataset, this corresponded to a linkage distance of approximately 1–1.5, which visually separated morphologically distinct neuron types (Figures 2A and Figure 2—figure supplement 3A). This threshold was used only as a visual aid and not as a hard cutoff for quantitative grouping.

      The Methods section says that the classification "included left-right comparisons". What does that mean? What are the implications of the authors only having proofread a subset of neurons in T1L (see below)? 

      All adult leg motor neurons and 13A neurons (except one, 13A-ε) have neurite arbors restricted to the local, ipsilateral neuropil associated with the nearest leg.  Although 13B neurons have contralateral cell bodies, their projections are also entirely ipsilateral. The Tuthill Lab, with contributions from our group, focused proofreading efforts on the left front neuropil (T1L) in FANC. This is also where the motor neuron to muscle mapping has been most extensively done. We reconstructed/proofread the 13A and 13B neurons from the right side as well (T1R). We see similar clustering based on morphology and connectivity here as well.  

      Reconstructions lack scale bars and information on orientation (also in other figures), and the figures for the 13B analysis are not consistent with the main figure (e.g., labelling of clusters in panel B along x,y axes).

      Added.  

      (b) Figure 2B: Since the cosine similarity matrix's values should go from -1 to 1, why was a color map used ranging from 0 to 1? 

      While cosine similarity values can theoretically range from -1 to 1, in our case, all vector entries (i.e., synaptic weights) are non-negative, as they reflect the number of synapses from each 13A neuron to its downstream targets. This means all pairwise cosine similarities fall within the 0 to 1 range. 

      Why are some neurons not included in this figure, like 1g, 2b, 3c-f (also in Supplement 3)?

      The few 13A neurons that don’t connect to motor neurons are not shown in the figure.

      (c) Figures 2C and D: the overlaid neurites are difficult to distinguish from one another. If the point here is to show that each 13A neuron class innervates specific motor neurons, then this is not the clearest way of doing that. For instance, the legend indicates that extensors are labelled in red, and that MNs with the highest number of synapses are highlighted in red - does that work? I could not figure out what was going on. On a more general point: if two cells are connected, does that not automatically mean that they should overlap in their projection patterns?

      We intended these panels to illustrate that 13A neurons synapse onto overlapping regions of motor neurons, thereby creating a spatial representation of muscle targets. However, we agree that overlapping multiple neurons in a single flat projection makes the figure difficult to interpret. We have therefore removed Figures 2C and 2D.

      While neurons must overlap at least somewhere if they form a synaptic connection, the amount of their neurites that overlap can vary, and more extensive overlap suggests more possible connections. Because the synapses are computationally predicted, examining the overlap helps to confirm that these predictions are consistent.

      While connected neurons must overlap locally at their synaptic sites, they do not necessarily show extensive or spatially structured overlap of their projections. For example, descending neurons or 13B interneurons may form synapses onto motor neurons without exhibiting a topographically organized projection pattern. In contrast, 13A→MN connectivity is organized in a structured manner: specialist 13A neurons align with the myotopic map of MN dendrites, whereas generalist 13As project more broadly and target MN groups across multiple leg segments, reflecting premotor synergies. This spatial organization—combining both joint-specific and multi-joint representations—was a key finding we wished to highlight, and we have revised the Results text to make this clearer.

      (d) Figure 2 - Figure Supplement 1: Why are these results presented in a way that goes against the morphological clustering results, but without explanation? Clusters 1-3 seem to overlap in their connectivity, and are presented in a mixed order. Why is this ignored? Are there similar data for 13B?

      The morphological clusters 1–3 do exhibit overlapping connectivity, but this is consistent with both their anatomical similarity and premotor connectivity. Specifically, Cluster 1 neurons connect to SE and TrE motor neurons, Cluster 2 connects only to TrE motor neurons, and Cluster 3 targets multiple motor pools, including SE and TrE (Figure 2—Figure Supplement 1B). This overlap is also reflected in the high pairwise cosine similarity among Clusters 1–3 shown in Figure 2B. Thus, their similar connectivity profiles align with their proximity in the NBLAST dendrogram.

      Regarding 13B neurons: there is no clear correlation between morphological clusters and downstream motor targets, as shown in the cosine similarity matrix (Figure 2—figure supplement 3). Moreover, even premotor 13B neurons that fall within the same morphological cluster do not connect to the same set of motor neurons (Figure 3—figure supplement 1F). For example, 13B-2a connects to LTrM and tergo-trochanteral MNs, 13B-2b connects to TiF MNs, and 13B-2g connects to Tr-F, TiE, and tergo-T MNs. Together, these results demonstrate that 13A neurons are spatially organized in a manner that correlates with their motor neuron targets, whereas 13B neurons lack such spatially structured organization, suggesting distinct principles of connectivity for these two inhibitory premotor populations.

      (e) Figure 2 - Figure Supplement 2: A comparison is made here between T1R (proofread) and T1L (largely not proofread). A general point is made here that there are "similar numbers of neurons and cluster divisions". First, no quantitative comparison is provided, making it difficult to judge whether this point is accurate. Second, glancing at the connectivity diagram, I can identify a large number of discrepancies. How should we interpret those? Can T1L be proofread? If this is too much of a burden, results should be presented with that as a clear caveat.

      The 13A and 13B neurons in the T1L hemisegment are fully proofread (Lesser et al, 2024, current publication); the T1R has been extensively analyzed as well.  To compare the clustering and match identities of 13A and 13B neurons on the left and the right, We mirrored the 13A neurons from the left side and used NBLAST to match them with their counterparts on the right.

      While individual synaptic counts differ between sides in the FANC dataset (T1L generally showing higher counts), the number of 13A neurons, their clustering, and the overall patterns of connectivity are largely conserved between T1L and T1R.

      Importantly, each 13A cluster targets the same subset of motor neurons on both sides, preserving the overall pattern of connectivity. The largest divergence is seen in cluster 9, which shows more variable connectivity.  

      (f) Figure 2 - Figure Supplements 4 & 5: Why did the authors choose to present the particular cell type in Supplement 4?  Why are the cell types in Supplement 5 presented differently? Labels in Supplement 5 are illegible, but I imagine this is due to the format of the file presented to reviewers. Why are there no data for 13B?

      We chose to present the particular cell type in Supplement 4 because it corresponds to cell types targeted in the genetic lines used in our behavioral experiments. The 13A neuron shown is also one of the primary neurons in this lineage. This example illustrates its broader connectivity beyond the inhibitory and motor connections emphasized in the main figures.

      In Supplement 5, we initially aimed to highlight that the major downstream targets of 13A neurons are motor neurons. We have now removed this figure and instead state in the text that the major downstream targets are MNs.

      We did not present 13B neurons in the same format because their major downstream targets are not motor neurons. Instead, we emphasize their role in disinhibition and their connections to 13A neurons, as shown in a specific example in Figure 3—figure supplement 2. This 13B neuron also corresponds to a cell type targeted in the genetic line used in our behavioral experiments.

      (3) Figure 3:

      (a) Figure 3A: the collection of diagrams is not clear. I'd suggest one diagram with all connections included repeated for each subpanel, with each subpanel highlighting relevant connections and greying out irrelevant ones to the type of connection discussed. The nomenclature should be consistent between the figure and the legend (e.g., feedforward inhibition vs direct MN inhibition in A1.

      The intent of Figure 3A is to highlight individual circuit motifs by isolating them in separate panels. Including all connections in every sub panel would likely reduce clarity and make it harder to follow each motif. For completeness, we show the full set of connections together in Panel D. We updated the nomenclature as suggested. 

      (b) Figure 3B: Why was the medial joint discussed in detail? Do the thicknesses of the lines represent the number of synapses? There should be a legend, in that case. Why are the green edges all the same thickness? Are they indeed all connected with a similarly low number of synapses?

      We focused on the medial joint (femur-tibia joint) because it produces alternating flexion and extension of the tibia during both head sweeps and leg rubbing, which are the main grooming actions we analyzed. During head grooming, the tarsus is typically suspended in the air, so the cleaning action is primarily driven by tibial movements generated at the medial joint. 

      The thickness of the edges represents the number of synapses, and we have now clarified this in the legend. The green edges represent connections from 13B neurons, which were manually added to the graph, as described in the Methods section. 13B neurons are smaller than 13A neurons and form significantly fewer total downstream synapses. For example, the 13B neuron shown in Figure 3—figure supplement 2 makes a total of 155 synapses to all downstream neurons, with only 22 synapses to its most strongly connected partner, a 13A neuron. The relatively sparse connectivity of 13B neurons is shown in thinner or uniform edge weights in this graph.

      (C) Figure 3C: This is a potentially important panel, but the connections are difficult to interpret. Moreover, the text says, "This organizational motif applies to multiple joints within a leg as reciprocal connections between generalist 13A neurons suggest a role in coordinating multi-joint movements in synergy". To what extent is this a representative result? The figure also has an error in the legend (it is not labelled as 3C).

      This statement is true and based on the connectivity of these neurons. We now added

      “Data for 13A-MN connections shown in Figure 2—figure supplement 1 I9, I6, I7, H9, H4, and H5; 13A-13A connections shown in Figure 3—figure supplement 1C.” to the figure legend.

      Thanks, we fixed the labelling error.

      (d) Figure 3 - Figure Supplement 1: Panel A is very difficult to interpret. Could a hierarchical diagram be used, or some other representation that is easier to digest?

      Panel A provides a consolidated view of all upstream and downstream interconnections among individual 13A and 13B neurons, allowing readers to quickly assess which neurons connect to which others without having to examine all subpanels. For a hierarchical representation, we have provided individual neuron-level diagrams in Panels C–F. 

      (e) Figure 3 - Figure Supplement 2: Why was this cell type selected?

      We selected this 13B because it is involved in the disinhibition of 13A neurons and is also present in the genetic line used for our behavioral experiments. 

      (f) Figure 3 - Figure Supplement 3: The diagram is confusing, with text aligned randomly, and colors lacking some explanations. Legend has odd formatting.

      The diagram layout and text alignment are designed to reflect the logical grouping of proprioceptors, 13A neurons, and motor neurons. To improve clarity, we have added node colors, included a written explanation for edge colors, and corrected the formatting of the figure legend.

      (4) Figure 4:

      (a) Figure 4A: This has no quantification, poor labelling, and odd units (centiseconds?). The colours between the left and right panels also don't align.

      We have fixed these issues.

      (b) Figure 4D-K: The ranges on the different axes are not the same (e.g., y axis on box plots, x axis on histograms). This obscures the fact that the differences between experimental and control, which in many cases are not big, are not consistent between the various controls. Moreover, the data that are plotted are, as far as I can tell (which is also to say: this should be explained), one value per frame. With imaging at 100Hz, this means that an enormous number of values are used in each analysis. Very small differences can therefore be significant in a statistical sense. However, how different something is between conditions is important (effect size), and this is not taken int account in this manuscript. For instance, in 4D-J, the differences in the mean seem to be minimal. Should that not be taken into consideration? A point in case is panel D in Figure 4 - Figure Supplement 1: even with near identical distributions, a statistically significant difference is detected. The same applies to Figure 4 - Figure Supplements 1-3. Also, what do the boxes and whiskers in the box plots show, exactly?

      We have re-plotted all summary panels using linear mixed-effects models (LMMs) as suggested. In the updated plots, each dot represents the mean value for a single animal, and bar height represents the group mean. Whiskers indicate the 95% confidence interval around the group mean. This approach avoids inflating sample size by using per-frame values and provides a more accurate view of both variability and effect size. 

      (e) Figure 4 - Figure Supplement 1: There are 6 cells labelled in the split line; only 4 are shown in A3. Is cluster 6 a convincing match between EM and MCFO?

      We indeed report four neurons targeted by the split-GAL4 line in flip out clones. Generating these clones was technically challenging. In our sample (n=23), we may not have labeled all of the neurons.  Alternatively, two neurons may share very similar morphology and connectivity, making it difficult to tell them apart. We have added this clarification to the revised figure legend.

      It is interesting to see data on walking in panel K, but why were these analyses not done on any of the other manipulations? What defect produced the reduction in velocity, exactly? How should this be interpreted?

      Our primary focus was on grooming, but we did observe changes in walking, so we report illustrative examples. We initially included a panel showing increased walking velocity upon 13A activation, but this effect did not survive FDR correction and was removed in the revised version. We instead included data for 13A silencing which did not affect the frequency of joint movements during walking. However, spatial aspects of walking were affected: the distance between front leg tips during stance was reduced, indicating that although flies continued to walk rhythmically, the positioning of the legs was altered. This suggests that these specific 13A neurons may influence coordination and limb placement during walking without disrupting basic rhythmicity. As reviewer #2 also noted, dust may itself affect walking, so we have chosen not to further pursue this aspect in the current study.

      (f) Figure 4 - Figure Supplement 2: panel A is identical to Figure 1 - Figure Supplement 1C. This figure needs particular attention, both in content and style. Why present data on silencing these neurons in C-D, but not in E-F?

      We removed the panel Figure 1 - Figure Supplement 1C and kept it in Figure 4 - Figure Supplement 2 A. E-F also shows data on silencing, as C’.

      (g) Figure 4 - Figure Supplement 3: In panel B, the authors should more clearly demonstrate the identity of 4b and 4a. Why present such a limited number of parameters in F and G?

      The cells shown in panel B represent the best matches we could identify between the light-level expression pattern and EM reconstructions. In panels F and G, we focused on bout duration, as leg position/inter-leg distance and frequency were already presented (in Figure 4). Together, these parameters demonstrate the role of 13B neurons in coordinating leg movements. Maximum angular velocity of proximal joints was not significantly affected and is therefore not included.

      (5) Figure 5:

      (a) Figure 5B: Lacks a quantification of the periodic nature of the behavior, which is required to compare to experimental conditions, e.g., in panel C.

      Added

      (b) Figure 5C: Requires a quantification; stimulus dynamics need to be incorporated.

      Added

      (c) Figure 5D: More information is needed. Does "Front leg" mean "leg rub", and "Head" "head sweep"? How do the dynamics in these behaviors compare to normal grooming behavior?

      Yes, head grooming is head sweeps and Front leg grooming is leg rub. Comparison added, shown in 5E-F

      (d) Figure 5E: How should we interpret these plots? Do these look like normal grooming/walking?

      We have now included the comparison.

      (e) Figure 5F: Needs stats to compare it to 5B'.

      Done

      (6) Figure 6:

      (a) Figure 6A: I think the circuit used for the model is lacking the claw/hook extension - 13Bs connection. Any other changes? What is the rationale?

      13Bs upstream of these particular 13As do not receive significant connections from claw/hook neurons (there’s only one ~5 synapses connection from one hook extension to one 13B neurons, which we neglected for the modeling purpose). 

      (b) Figure 6B and C: Needs labels, legend; where is 13B?

      In the figure legend we now added: “The 13B neurons in this model do not connect to each other, receive excitatory input from the black box, and only project to the 13As (inhibitory). Their weight matrix, with only two values, is not shown.” We added the colorbar and corrected the color scheme.

      (c) Figure 6D-H: plots are very difficult to interpret. Units are also missing (is "Time" correct?).

      The units are indeed Time in frames (of simulation). We added this to the figure and the legend. We clarified the units of all variables in these panels. Corrected the color scheme and added their meaning to the legend text.

      (d) Figure 6I: I think the authors should consider presenting this in a different format.

      (e)  Figure 6 J and K (also Figure Supplement): lacks labels.

      We added labels for the three joints, increased the size of fonts for clarity, and added panel titles on the top.

      More specific suggestions:

      (1) It would be helpful if the titles of all figures reflected the take-away message, like in Figure 2.

      (2) "Their dendrites occupy a limited region of VNC, suggesting common pre-synaptic inputs" - all dendrites do, so I'd suggest rephrasing to be more precise.

      (3) "We propose that the broadly projecting primary neurons are generalists, likely born earlier, while specialists are mostly later-born secondary neurons" - this needs to be explained.

      We added the explanation.

      We propose that the broadly projecting primary neurons are generalists, likely born earlier, while specialists are mostly later-born secondary neurons. This is consistent with the known developmental sequence of hemilineages, where early-born primary neurons typically acquire larger arbors and integrate across broader premotor and motor targets, whereas later-born secondary neurons often have more spatially restricted projections and specialized roles[18,19,81,82,85]. Our morphological clustering supports this idea: generalist 13As have extensive axonal arbors spanning multiple leg segments, whereas specialist neurons are more narrowly tuned, connecting to a few MN targets within a segment. Thus, both their morphology and connectivity patterns align with the expectation from birth-order–dependent diversification within hemilineages.

      (4) "We did not find any correlation between the morphology of premotor 13B and motor connections" - this needs to be explained, as morphology constrains connectivity.

      We agree that morphology often constrains connectivity. However, in contrast to 13A neurons—where morphological clusters strongly predict MN connectivity—we did not observe such a correlation for 13B neurons. As we noted in our response to comment 2d, 13B neurons can form synapses onto MNs without exhibiting extensive or spatially structured overlap of their axonal projections with MN dendrites. This suggests that 13B→MN connectivity may be governed by more local, synapse-specific rules rather than by large-scale morphological positioning, in contrast to the spatially organized premotor map we observe for 13As.

      (5) "Based on their connectivity, we hypothesized that continuously activating them might reduce extension and increase flexion. Conversely, silencing them might increase extension and reduce flexion." - these clear predictions are then not directly addressed in the results that follow.

      We have now expanded this section.

      (6) "Thus, 13A neurons regulate both spatial and temporal aspects of leg coordination" "Together, 13A and 13B neurons contribute to both spatial and temporal coordination during grooming" - are these not intrinsically linked? This needs to be explained/justified.

      The spatial (leg positioning, joint angles) and temporal (frequency, rhythm) aspects are often linked, but they can be at least partially dissociated. This has been shown in other systems: for example, Argentine ants reduce walking speed on uneven terrain primarily by decreasing stride frequency while maintaining stride length (Clifton et al., 2020), and Drosophila larvae adjust crawling speed mainly by modulating cycle period rather than the amplitude of segmental contractions (Heckscher et al., 2012). Consistent with these findings, we observe that 13A neuron manipulation in dusted flies significantly alters leg positioning without changing the frequency of walking cycles. Thus, leg positioning can be perturbed while the number of extension–flexion cycles per second remains constant, supporting the view that spatial and temporal features are at least partially dissociable.

      (7) "Connectome data revealed that 13B neurons disinhibit motor pools (...) One of these 13B neurons is premotor, inhibiting both proximal and tibia extensor MN" - these are not possible at the same time.

      We show that the 13B population contains neurons with distinct connectivity motifs:

      some inhibit premotor 13A neurons (leading to disinhibition of motor pools), while others directly inhibit motor neurons. The split-GAL4 line we use labels three 13B neurons—two that inhibit the primary 13A neuron 13A-9d-γ (which targets proximal extensor and medial flexor MNs) and one that is premotor, directly inhibiting both proximal and tibia extensor MNs. Although these functions may appear mutually exclusive, their combined action could converge to a similar outcome: disinhibition of proximal extensor and medial flexor MNs while simultaneously inhibiting medial extensor MNs. This suggests that the labeled 13B neurons act in concert to bias the network toward a specific motor state rather than producing contradictory effects.

      (8) "we often observed that one leg became locked in flexion while the other leg remained extended, (indicating contribution from additional unmapped left right coordination circuits)." - Are these results not informative? I'd suggest the authors explain the implications of this more, rather than mentioning it within brackets like this.

      We agree with the reviewer that these results are highly informative. The observation that one leg can remain locked in flexion while the other stays extended suggests that additional left–right coordination circuits are engaged during grooming. This cross-talk is likely mediated by commissural interneurons downstream of inhibitory premotor neurons, which have not yet been systematically studied. Dissecting these circuits will require a dedicated project combining bilateral connectomic reconstruction, studying downstream targets of these commissural neurons, and functional interrogation, which is beyond the scope of the current study.

      (9) "Indeed, we observe that optogenetic activation of specific 13A and 13B neurons triggers grooming movements. We also discover that" - this phrasing suggests that this has already been shown.external

      We replaced ‘indeed’ with “Consistent with this connectivity,”

      (10) "But the 13A circuitry can still produce rhythmic behavior even without those  sensory inputs (or when set to a constant value), although the legs become less coordinated." - what does this mean?

      We can train (fine-tune) the model without the descending inputs from the “black box” and the behavior will still be rhythmic, meaning that our modeled 13A circuit alone can produce rhythmic behavior, i.e. the rhythm is not generated externally (by the “black box”). We added Figure 7 to the MS and re-wrote this paragraph. In the revised manuscript we now state: “But the 13A circuitry can still produce rhythmic behavior even without those excitatory inputs from the “black box” (or when set to a constant value), although the legs become less coordinated (because they are “unaware” of each other’s position at any time). Indeed, when we refine the model (with the evolutionary training) without the “black box” (using instead a constant input of 0.1) the behavior is still rhythmic although somewhat less sustained (Figure 7). This confirms that the rhythmic activity and behavior can emerge from the modeled pre-motor circuitry itself, without a rhythmic input.”

      (11) "However, to explore the possibility of de novo emergent periodic behavior (without the direct periodic descending input) we instead varied the model's parameters around their empirically obtained values." - why do the authors not show how the model performs without tuning it first? What are the changes exactly that are happening as a result of the tuning? Are there specific connections that are lost? Do I interpret Figure 6B and C correctly when I think that some connections are lost (e.g., an SN-MN connection)? How does that compare to the text, which states that "their magnitudes must be at least 80% of the empirical weights"?

      Without the fine-tuning we do not get any behavior (the activation levels saturate). So, we tolerate 20% divergence from the empirically established weights and we keep the signs the same. However, in the previous version we allowed the weights to decrease below 20% of the empirical weight (as long as the sign didn’t change) but not above (the signs were maintained and synapses were not added or removed). We thank the reviewer for observing this important discrepancy. In the current version we ensured that the model’s weights are bounded in both directions (the tolerance = 0.2), but we also partially relaxed the constraint on adjacency matrix re-scaling (see Methods, the “The fine-tuning of the synaptic weights” section, where we now clarify more precisely how the evolving model is fitted to the connectome constraints). We then re-ran the fine-tuning process. The Figure 6B and C is now corrected with the properly constrained model, as well as other panels in the figure.  We also applied a better color scheme (now, blue is inhibitory and red is excitatory) for Fig. 6B and C.

      (12) "Interestingly, removing 13As-ii-MN connections to the three MNs (second row of the 13A → MN matrices in Figures 6B and C) does not have much effect on the leg movement (data not shown). It seems sufficient for this model to contract only one of the two antagonistic muscles per joint, while keeping the other at a steady state." - this is not clear.

      We repeated this test with the newly fine-tuned model and re-wrote the result as follows:  “...when we remove just the 13A-i-MN connections (which control the flexors of the right leg) we likewise get a complete paralysis of the leg. However, removing the 13A-ii-MN (which control the extensors of the right leg) has only a modest effect on the leg movement. So, we need the 13A-i neurons to inhibit the flexors (via motor neurons), but not extensors, in order to obtain rhythmic movements.”

      (13) The Discussion needs to reference the specific Results in all relevant sections.

      We have revised the discussion to explicitly reference the specific results.

      (14) "Flexors and extensors should alternate" - there are circumstances in which flexors and extensors should co-contract. For instance, co-contraction modulates joint stiffness for postural stability and helps generate forces required for fast movements.

      Thanks for pointing this out. We added “However, flexor–extensor co-contraction can also be functionally relevant, such as for modulating joint stiffness during postural stabilization or for generating large forces required for fast movements (Zakotnik et al., 2006; Günzel et al., 2022; Ogawa and Yamawaki 2025). Some generalist 13A neurons could facilitate co-contraction across different leg segments, but none target antagonistic motor neurons controlling the same joint. Therefore, co-contraction within a single joint would require the simultaneous activation of multiple 13A neurons.”

      (15) "While legs alternate between extension and flexion, they remain elevated during grooming. To maintain this posture, some MNs must be continuously activated while their antagonists are inactivated." - this is not necessarily correct. Small limbs, like those of Drosophila, can assume gravity-independent rest angles (10.1523/JNEUROSCI.5510-08.2009).

      We added it to discussion

      (16) The discussion "Spatial Mapping of premotor neurons in the nerve cord" seems to me to be making obvious points, and does not need to be included.

      We have now revised this section to highlight the significance of 13A spatial organization, emphasizing premotor topographic mapping, multi-joint movement modules, and parallels to myotopic, proprioceptive, and vertebrate spinal maps.

      (17) Key point, albeit a small one: "Normal activity of these inhibitory neurons is critical for grooming" - the use of the word critical is problematic, and perhaps typical of the tone of the manuscript. These animals still groom when many of these neurons are manipulated, so what does "critical" really mean?

      In this instance, we now changed “critical” to “important”. We observed that silencing or activating a large number (>8) 13A neurons or few 13A and B neurons together completely abolishes grooming in dusted flies as flies get paralyzed or the limbs get locked in extreme poses. Therefore we think we have a justification for the statement that these neurons are critical for grooming.  These neurons may contribute to additional behaviors, and there may be partially redundant circuits that can also support grooming. We have revised the manuscript  with the intention of clarifying both what we have observed and the limits.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors endeavor to capture the dynamics of emotion-related brain networks. They employ slice-based fMRI combined with ICA on fMRI time series recorded while participants viewed a short movie clip. This approach allowed them to track the time course of four non-noise independent components at an effective 2s temporal resolution at the BOLD level. Notably, the authors report a temporal sequence from input to meaning, followed by response, and finally default mode networks, with significant overlap between stages. The use of ICA offers a data-driven method to identify large-scale networks involved in dynamic emotion processing. Overall, this paradigm and analytical strategy mark an important step forward in shifting affective neuroscience toward investigating temporal dynamics rather than relying solely on static network assessments

      Strengths:

      (1) One of the main advantages highlighted is the improved temporal resolution offered by slice-based fMRI. However, the manuscript does not clearly explain how this method achieves a higher effective resolution, especially since the results still show a 2s temporal resolution, comparable to conventional methods. Clarification on this point would help readers understand the true benefit of the approach.

      (2) While combining ICA with task fMRI is an innovative approach to study the spatiotemporaldynamics of emotion processing, task fMRI typically relies on modeling the hemodynamic response (e.g., using FIR or IR models) to mitigate noise and collinearity across adjacent trials. The current analysis uses unmodeled BOLD time series, which might risk suffering from these issues.

      (3) The study's claims about emotion dynamics are derived from fMRI data, which are inherently affected by the hemodynamic delay. This delay means that the observed time courses may differ substantially from those obtained through electrophysiology or MEG studies. A discussion on how these fMRI-derived dynamics relate to - or complement - is critical for the field to understand the emotion dynamics.

      (4) Although using ICA to differentiate emotion elements is a convenient approach to tell a story, it may also be misleading. For instance, the observed delayed onset and peak latency of the 'response network' might imply that emotional responses occur much later than other stages, which contradicts many established emotion theories. Given the involvement of largescale brain regions in this network, the underlying reasons for this delay could be very complex.

      Concerns and suggestions:

      However, I have several concerns regarding the specific presentation of temporal dynamics in the current manuscript and offer the following suggestions.

      (1) One selling point of this work regarding the advantages of testing temporal dynamics is the application of slice-based fMRI, which, in theory, should improve the temporal resolution of the fMRI time course. Improving fMRI temporal resolution is critical for a research project on this topic. The authors present a detailed schematic figure (Figure 2) to help readers understand it. However, I have difficulty understanding the benefits of this method in terms of temporal resolution.

      (a) In Figure 2A, if we examine a specific voxel in slice 2, the slice acquisitions occur at 0.7s, 2.7s, and 4.7s, which implies a temporal resolution of 2s rather than 0.7s. I am unclear on how the temporal resolution could be 0.7s for this specific voxel. I would prefer that the authors clarify this point further, as it would benefit readers who are not familiar with this technology.

      We very much appreciate these concerns as they highlight shortcomings in our explanation of the method. Please note that the main explanation of the method (and comparison with expected HRF and FIR based methods) is done in Janssen et al. (2018, NeuroImage; see further explanations in Janssen et al., 2020). However, to make the current paper more selfcontained, we provided further explanation of the Slice-Based method in Figure 2. With respect to the specific concern of the reviewer, in the hypothetical example used in Figure 2, the temporal resolution of the voxel on slice 2 is 0.7s because it combines the acquisitions from stimulus presentations across all trials. Specifically, given the specific study parameters as outlined in Figures 2A and B, slice 2 samples the state of the brain exactly 0s after stimulus presentation on trial 1 (red color), 0.7s after stimulus presentation on trial 3 (green color), and 1.3s after stimulus presentation on trial 2 (yellow color). Thus after combining data acquisitions across these three 3 stimuli presentations, slice 2 has sampled the state of the brain at timepoints that are multiples of 0.7s starting from stimulus onset. This is why we say that the theoretical maximum temporal resolution is equal to the TR divided by the number of slices (in the example 2/3 = 0.7s, in the actual experiment 3/39 = 0.08s). In the current study we used temporal binning across timepoints to reduce the temporal resolution (to 2 seconds) and improve the tSNR.

      We have updated the legend of Figure 3 to more clearly explain this issue.

      (b) Even with the claim of an increased temporal resolution (0.7s), the actual data (Figure 3) still appears to have a 2s resolution. I wonder what specific benefit slice-based fMRI brings in terms of testing temporal dynamics, aside from correcting the temporal distortions that conventional fMRI exhibits.

      This is a good point. In the current experiment, the TR was 3s, but we extracted the fMRI signal at 2s temporal resolution, which means an increment of 33%. In this study we did not directly compare the impact of different temporal resolutions on the efficacy of detection of network dynamics. Indeed, we agree with the reviewer that there remain many unanswered questions about the issue of temporal resolution of the extracted fMRI signal and the impact on the ability to detect fMRI network dynamics. We think that questions such as those posed by the reviewer should be addressed in future studies that are directly focused on this issue. We have updated our discussion section (page 21-22) to more clearly reflect this point of view.

      (2) In task-fMRI, the hemodynamic response is usually estimated using a specific model (e.g., FIR, IR model; see Lindquist et al., 2009). These models are effective at reducing noise and collinearity across adjacent trials. The current method appears to be conducted on unmodeled BOLD time series.

      (a) I am wondering how the authors avoid the issues that are typically addressed by these HRF modeling approaches. For example, if we examine the baseline period (say, -4 to 0s relative to stimulus onset), the activation of most networks does not remain around zero, which could be due to delayed influences from the previous trial. This suggests that the current time course may not be completely accurate.

      We thank the reviewer for highlighting this issue. Let us start by reiterating what we stated above: That there are many issues related to BOLD signal extraction and fMRI network discovery in task-based fMRI that remain poorly understood and should be addressed in future work. Such work should explore, for example, the impact of using a FIR vs Slice-based method on the discovery of networks in task-fMRI. These studies should also investigate the impact of different types of baselines and baseline durations on the extraction of the BOLD signal and network discovery. For the present purposes, our goal was not to introduce a new technique of fMRI signal extraction, but to show that the slice-based technique, in combination with ICA, can be used to study the brain’s networks dynamics in an emotional task. In other words, while we clearly appreciate the reviewer’s concerns and have several other studies underway that directly address these concerns, we believe that such concerns are better addressed in independent research. See our discussion on page 21-22 that addresses this issue.

      (b) A related question: if the authors take the spatial map of a certain network and apply a modeling approach to estimate a time series within that network, would the results be similar to the current ICA time series?

      Interesting point. Typically in a modeling approach the expected HRF (e.g., the double gamma function) is fitted to the fMRI data. Importantly, this approach produces static maps of the fit between the expected HRF and the data. By contrast, model-free approaches such as FIR or slice-based methods extract the fMRI signal directly from the data without making apriori assumptions about the expected shape of the signal. These approaches do not produce static maps but instead are capable of extracting the whole-brain dynamics during the execution of a task (event-related dynamics). These data-driven approaches (FIR, SliceBased, etc) are therefore a necessary first step in the analyses of the dynamics of brain activity during a task. The subsequent step involves the analyses of these complex eventrelated brain dynamics. In the current paper we suggest that a straightforward way to do this is to use ICA which produces spatial maps of voxels with similar time courses, and hence, yields insights into the temporal dynamics of whole-brain fMRI networks. As we mentioned above, combining ICA with a high temporal resolution data-driven signal is new and there are many new avenues for research in this burgeoning new field.

      (3) Human emotion should be inherently fast to ensure survival, as shown in many electrophysiology and MEG studies. For example, the dynamics of a fearful face can occur within 100ms in subcortical regions (Méndez-Bértolo et al., 2016), and general valence and arousal effects can occur as early as 200ms (e.g., Grootswagers et al., 2020; Bo et al., 2022). In contrast, the time-to-peak or onset timing in the BOLD time series spans a much larger time range due to the hemodynamic delay. fMRI findings indeed add spatial precision to our understanding of the temporal dynamics of emotion, but could the authors comment on how the current temporal dynamics supplement those electrophysiology studies that operate on much finer temporal scales?

      We really like this point. One way that EEG and fMRI are typically discussed is that these two approaches are said to be complementary. While EEG is able to provide information on temporal dynamics, but not spatial localization of brain activity, fMRI cannot provide information on the temporal dynamics, but can provide insights into spatial localization. Our study most directly challenges the latter part of this statement. We believe that by using tasks that highlight “slow” cognition, fMRI can be used to reveal not only spatial but also temporal information of brain activity. The movie task that we used presumably relies on a kind of “slow” cognition that takes place on longer time scales (e.g., the construction of the meaning of the scene). Our results show that with such tasks, whole-brain networks with different temporal dynamics can be separated by ICA, at odds with the claim that fMRI is only good for spatial information. One avenue of future research would be to attempt such “slow” tasks directly with EEG and try to find the electrical correlates of the networks detected in the current study.

      We hope to have answered the concerns of the reviewer.

      (4) The response network shows activation as late as 15 to 20s, which is surprising. Could the authors discuss further why it takes so long for participants to generate an emotional response in the brain?

      We thank the reviewer for this question. Our study design was such that there was an initial movie clip that lasted 12.5s, which was then followed by a two-alternative forced-choice decision task (including a button press, 2.5s), and finally followed by a 10s rest period. We extracted the fMRI signal across this entire 25s period (actually 28s because we also took into account some uncertainty in BOLD signal duration). Network discovery using ICA then showed various networks with distinct time courses (across the 25s period), including one network (IC2 response) that showed a peak around 21s (see Figure 3). Given the properties of the spatial map (eg., activity in primary motor areas, Figure 4), as well as the temporal properties of its timecourse (e.g., peak close to the response stage of the task), we interpreted this network as related to generating the manual response in the two-alternative forced-choice decision task. Further analyses showed that this aspect of the task (e.g., deciding the emotion of the character in the movie clip) was also sensitive to the emotional content of the earlier movie clip (Figure 6 and 7).

      We have further clarified this aspect of our results (see pages 16-17). We thank the reviewer for pointing this out.

      (5) Related to 4. In many theories, the emotion processing stages-including perception, valuation, and response-are usually considered iterative processes (e.g., Gross, 2015), especially in real-world scenarios. The advantage of the current paradigm is that it incorporates more dynamic elements of emotional stimuli and is closer to reality. Therefore, one might expect some degree of dynamic fluctuation within the tested brain networks to reflect those potential iterative processes (input, meaning, response). However, we still do not observe much brain dynamics in the data. In Figure 5, after the initial onset, most network activations remain sustained for an extended period of time. Does this suggest that emotion processing is less dynamic in the brain than we thought, or could it be related to limitations in temporal resolution? It could also be that the dynamics of each individual trial differ, and averaging them eliminates these variations. I would like to hear the authors' comments on this topic.

      We thank the reviewer for this interesting question. We are assuming the reviewer is referring to Figure 3 and not Figure 5. Indeed what Figure 3 shows is the average time course of each detected network across all subjects and trial types. This figure therefore does not directly show the difference in dynamics between the different emotions. However, as we show in further analyses that examine how emotion modulates specific aspects of the fMRI signal dynamics (time to peak, peak value, duration) of different networks, there are differences in the dynamics of these networks depending on the emotion (Figure 6 and 7). Thus, our results show that different emotions evoked by movie clips differ in their dynamics. Obviously, generalizing this to say that in general, different emotions have different brain dynamics is not straightforward and would require further study (probably using other tasks, and other emotions). We have updated the discussion section as well as the caption of Figure 3 to better explain this issue (see also comments by reviewer 2).

      (6) The activation of the default mode network (DMN), although relatively late, is very interesting. Generally, one would expect a deactivation of this network during ongoing external stimulation. Could this suggest that participants are mind-wandering during the later portion of the task?

      Very good point. Indeed this is in line with our interpretation. The late activity of the default mode network could reflect some further processing of the previous emotional experience. More work is required to clarify this further in terms of reflective, mind-wandering or regulatory processing. We have updated our discussion section to better highlight this issue (see page 19).

      We thank the reviewer for their really insightful comments and suggestions!

      Reviewer #2 (Public review):

      Summary:

      This manuscript examined the neural correlates of the temporal-spatial dynamics of emotional processing while participants were watching short movie clips (each 12.5 s long) from the movie "Forrest Gump". Participants not only watched each film clip, but also gave emotional responses, followed by a brief resting period. Employing fMRI to track the BOLD responses during these stages of emotional processing, the authors found four large-scale brain networks (labeled as IC0,1,2,4) were differentially involved in emotional processing. Overall, this work provides valuable information on the neurodynamics of emotional processing.

      Strengths:

      This work employs a naturalistic movie watching paradigm to elicit emotional experiences. The authors used a slice-based fMRI method to examine the temporal dynamics of BOLD responses. Compared to previous emotional research that uses static images, this work provides some new data and insights into how the brain supports emotional processing from a temporal dynamics view.

      Thank you!

      Weaknesses:

      Some major conclusions are unwarranted and do not have relevant evidence. For example, the authors seemed to interpret some neuroimaging results to be related to emotion regulation. However, there were no explicit instructions about emotional regulation, and there was no evidence suggesting participants regulated their emotions. How to best interpret the corresponding results thus requires caution.

      We thank the reviewer for pointing this out. We have updated the limitations section of our Discussion section (page 20) to better qualify our interpretations.

      Relatedly, the authors argued that "In turn, our findings underscore the utility of examining temporal metrics to capture subtle nuances of emotional processing that may remain undetectable using standard static analyses." While this sentence makes sense and is reasonable, it remains unclear how the results here support this argument. In particular, there were only three emotional categories: sad, happy, and fear. These three emotional categories are highly different from each other. Thus, how exactly the temporal metrics captured the "subtle nuances of emotional processing" shall be further elaborated.

      This is an important point. We also discuss this limitation in the “limitations” section of our Discussion (page 20). We again thank the reviewer for pointing this out.

      The writing also contained many claims about the study's clinical utility. However, the authors did not develop their reasoning nor elaborate on the clinical relevance. While examining emotional processing certainly could have clinical relevance, please unpack the argument and provide more information on how the results obtained here can be used in clinical settings.

      We very much appreciate this comment. Note that we did not intend to motivate our study directly from a clinical perspective (because we did not test our approach on a clinical population). Instead, our point is that some researchers (e.g., Kuppens & Verduyn 2017; Waugh et al., 2015) have conceptualized emotional disorders frequently having a temporal component (e.g., dwelling abnormally long on negative thoughts) and that our technique could be used to examine if temporal dynamics of networks are affected in such disorders. However, as we pointed out, this should be verified in future work. We have updated our final paragraph (page 22) to more clearly highlight this issue. We thank the reviewer for pointing this out.

      Importantly, how are the temporal dynamics of BOLD responses and subjective feelings related? The authors showed that "the time-to-peak differences in IC2 ("response") align closely with response latency results, with sad trials showing faster response latencies and earlier peak times". Does this mean that people typically experience sad feelings faster than happy or fear? Yet this is inconsistent with ideas such that fear detection is often rapid, while sadness can be more sustained. Understandably, the study uses movie clips, which can be very different from previous work, mostly using static images (e.g., a fearful or a sad face). But the authors shall explicitly discuss what these temporal dynamics mean for subjective feelings.

      Excellent point! Our results indeed showed that sad trials had faster reaction times compared to happy and fearful trials, and that this result was reflected in the extracted time-to-peak measures of the fMRI data (see Figure 8D). To us, this primarily demonstrates that, as shown in other studies (e.g., Menon et al., 1997), that gross differences detected in behavioral measures can be directly recovered from temporal measures in fMRI data, which is not trivial. However, we do not think we are allowed to make interpretations of the sort suggested by the reviewer (and to be clear: we do not make such interpretations in the paper). Specifically, the faster reaction times on sad trials likely reflect some audio/visual aspect of the movie clips that result in faster reaction times instead of a generalized temporal difference in the subjective experience of sad vs happy/fearful emotions. Presumably the speed with which emotional stimuli influence the brain depends on the context. Perhaps future studies that examine emotional responses while controlling for the audio/visual experience could shed further light on this issue. We have updated the discussion section to address the reviewer’s concern.

      We thank the reviewer for the interesting points which have certainly improved our manuscript!

      Reviewer #1 (Recommendations for the authors):

      Minor:

      (1) Please add the unit to the y-axis in Figure 7, if applicable.

      Done. We have added units.

      (2) Adding a note in the legend of Figure 3 regarding the meaning of the amplitude of the timeseries would be helpful.

      Done. We have added a sentence further explaining the meaning of the timecourse fluctuations.

      Related references:

      (1) Lindquist, M. A., Loh, J. M., Atlas, L. Y., & Wager, T. D. (2009). Modeling the hemodynamic response function in fMRI: efficiency, bias, and mis-modeling. Neuroimage, 45(1), S187-S198.

      (2) Méndez-Bértolo, C., Moratti, S., Toledano, R., Lopez-Sosa, F., Martínez-Alvarez, R., Mah, Y. H., ... & Strange, B. A. (2016). A fast pathway for fear in human amygdala. Nature neuroscience, 19(8), 1041-1049.

      (3) Bo, K., Cui, L., Yin, S., Hu, Z., Hong, X., Kim, S., ... & Ding, M. (2022). Decoding the temporal dynamics of affective scene processing. NeuroImage, 261, 119532.

      (4) Grootswagers, T., Kennedy, B. L., Most, S. B., & Carlson, T. A. (2020). Neural signatures of dynamic emotion constructs in the human brain. Neuropsychologia, 145, 106535.

      (5) Gross, J. J. (2015). The extended process model of emotion regulation: Elaborations, applications, and future directions. Psychological inquiry, 26(1), 130-137.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      “Ejdrup, Gether, and colleagues present a sophisticated simulation of dopamine (DA) dynamics based on a substantial volume of striatum with many DA release sites. The key observation is that a reduced DA uptake rate in the ventral striatum (VS) compared to the dorsal striatum (DS) can produce an appreciable "tonic" level of DA in VS and not DS. In both areas they find that a large proportion of D2 receptors are occupied at "baseline"; this proportion increases with simulated DA cell phasic bursts but has little sensitivity to simulated DA cell pauses. They also examine, in a separate model, the effects of clustering dopamine transporters (DAT) into nanoclusters and say this may be a way of regulating tonic DA levels in VS. I found this work of interest and I think it will be useful to the community. At the same time, there are a number of weaknesses that should be addressed, and the authors need to more carefully explain how their conclusions are distinct from those based on prior models.

      We appreciate that the reviewer finds our work interesting and useful to the community. However, we acknowledge it is important to discuss how our conclusions are different from those reached based on previous model. Already in the original version of the manuscript we discussed our findings in relation to earlier models; however, this discussion has now been expanded. In particular, we would argue that our simulations, which included updated parameters, represent more accurate portrayals of in vivo conditions as it is now specifically stated in lines 466-487. Compared to previous models our data highlight the critical importance of different DAT expression across striatal subregions as a key determinant of differential DA dynamics and differential tonic levels in DS compared to VS. We find that these conclusions are already highlighted in the Abstract and Discussion. 

      (1) The conclusion that even an unrealistically long (1s) and complete pause in DA firing has little effect on DA receptor occupancy is potentially important. The ability to respond to DA pauses has been thought to be a key reason why D2 receptors (may) have high affinity. This simulation instead finds evidence that DA pauses may be useless. This result should be highlighted in the abstract and discussed more.“

      This is an interesting point. We have accordingly carried out new simulations across a range of D2R affinities to assess how this will affect the finding that even a long pause in DA firing has little effect on DR2 receptor occupancy. Interestingly, the simulations demonstrate that this finding is indeed robust across an order of magnitude in affinity, although the sensitivity to a one-second pause goes up as the affinity reaches 20 nM. The data are shown in a revised Figure S1H. For description of the results, please see revised text lines 195-197. The topic is now mentioned in the abstract as well as further commented in the Discussion in lines 500-504.

      “(2) The claim of "DAT nanoclustering as a way to shape tonic levels of DA" is not very well supported at present. None of the panels in Figure 4 simply show mean steady-state extracellular DA as a function of clustering. Perhaps mean DA is not the relevant measure, but then the authors need to better define what is and why. This issue may be linked to the fact that DAT clustering is modeled separately (Figure 4) to the main model of DA dynamics (Figures 1-3) which per the Methods assumes even distribution of uptake. Presumably, this is because the spatial resolution of the main model is too coarse to incorporate DAT nanoclusters, but it is still a limitation.”

      We agree with the reviewer that steady-state extracellular DA as a function of DAT clustering is a useful measure. We have therefore simulated the effects of different nanoclustering scenarios on this measure. We found that the extracellular concentrations went from approximately 15 nM for unclustered DAT to more than 30 nM in the densest clustering scenario. These results are shown in revised Figure 4F and described in the revised text in lines 337-349.

      Further, we fully agree that the spatial resolution of the main model is a limitation and, ideally, that the nanoclustering should be combined with the large-scale release simulations. Unfortunately, this would require many orders of magnitude more computational power than currently available.

      “As it stands it is convincing (but too obvious) that DAT clustering will increase DA away from clusters, while decreasing it near clusters. I.e. clustering increases heterogeneity, but how this could be relevant to striatal function is not made clear, especially given the different spatial scales of the models.”

      Thank you for raising this important point. While it is true that DAT clustering increases heterogeneity in DA distribution at the microscopic level, the diffusion rate is, in most circumstances, too fast to permit concentration differences on a spatial scale relevant for nearby receptors. Accordingly, we propose that the primary effect of DAT nanoclustering is to decrease the overall uptake capacity, which in turn increases overall extracellular DA concentrations. Thus, homogeneous changes in extracellular DA concentrations can arise from regulating heterogenous DAT distribution. An exception to this would be the circumstance where the receptor is located directly next to a dense cluster – i.e. within nanometers. In such cases, local DA availability may be more directly influenced by clustering effects. Please see revised text in lines 354-362 for discussion of this matter.  

      “(3) I question how reasonable the "12/40" simulated burst firing condition is, since to my knowledge this is well outside the range of firing patterns actually observed for dopamine cells. It would be better to base key results on more realistic values (in particular, fewer action potentials than 12).”

      We fully agree that this typically is outside the physiological range. The values are included in addition to more realistic values (3/10 and 6/20) to showcase what extreme situations would look like. 

      “(4) There is a need to better explain why "focality" is important, and justify the measure used.”

      We have expanded on the intention of this measure in the revised manuscript (please see lines 266-268).  Thank you for pointing out this lack of clarification.  

      “(5) Line 191: " D1 receptors (-Rs) were assumed to have a half maximal effective concentration (EC50) of 1000 nM" The assumptions about receptor EC50s are critical to this work and need to be better justified. It would also be good to show what happens if these EC50 numbers are changed by an order of magnitude up or down.”

      We agree that these assumptions are critical. Simulations on effective off-rates across a range of EC50 values has now been included in the revised version in Figure 1I and is referred to in lines 188-189.  

      “(6) Line 459: "we based our receptor kinetics on newer pharmacological experiments in live cells (Agren et al., 2021) and properties of the recently developed DA receptor-based biosensors (Labouesse & Patriarchi, 2021). Indeed, these sensors are mutated receptors but only on the intracellular domains with no changes of the binding site (Labouesse & Patriarchi, 2021)" 

      This argument is diminished by the observation that different sensors based on the same binding site have different affinities (e.g. in Patriarchi et al. 2018, dLight1.1 has Kd of 330nM while dlight1.3b has Kd of 1600nM).”

      We sincerely thank the reviewer for highlighting this important point. We fully recognize the fundamental importance of absolute and relative DA receptor kinetics for modeling DA actions and acknowledge that differences in affinity estimates from sensor-based measurements highlight the inherent uncertainty in selecting receptor kinetics parameters. While we have based our modeling decisions on what we believe to be the most relevant available data, we acknowledge that the choice of receptor kinetics is a topic of ongoing debate. Importantly, we are making our model available to the research community, allowing others to test their own estimates of receptor kinetics and assess their impact on the model’s behavior. In the revised manuscript, we have further elaborated the rationale behind our parameter choices. Please see revised text in lines in lines 177-178 of the Results section and in lines 481-486 of the Discussion. 

      “(7) Estimates of Vmax for DA uptake are entirely based on prior fast-scan voltammetry studies (Table S2). But FSCV likely produces distorted measures of uptake rate due to the kinetics of DA adsorption and release on the carbon fiber surface.”

      We fully agree that this is a limitation of FSCV. However, most of the cited papers attempt to correct for this by way of fitting the output to a multi-parameter model for DA kinetics. If newer literature brings the Vmax values estimated into question, we have made the model publicly available to rerun the simulations with new parameters.

      “(8) It is assumed that tortuosity is the same in DS and VS - is this a safe assumption?”

      The original paper cited does not specify which region the values are measured in. However, a separate paper estimates the rat cerebellum has a comparable tortuosity index (Nicholson and Phillips, J Physiol. 1981), suggesting it may be a rather uniform value across brain regions. This is now mentioned in lines 98-99 and the reference has been included. 

      “(9) More discussion is needed about how the conclusions derived from this more elaborate model of DA dynamics are the same, and different, to conclusions drawn from prior relevant models (including those cited, e.g. from Hunger et al. 2020, etc)”.

      As part of our revision, we have expanded the current discussion of our finding in the context of previous models in the manuscript in lines 466-487.

      Reviewer #2 (Public review): 

      The work presents a model of dopamine release, diffusion, and reuptake in a small (100 micrometers^2 maximum) volume of striatum. This extends previous work by this group and others by comparing dopamine dynamics in the dorsal and ventral striatum and by using a model of immediate dopamine-receptor activation inferred from recent dopamine sensor data. From their simulations, the authors report two main conclusions. The first is that the dorsal striatum does not appear to have a sustained, relatively uniform concentration of dopamine driven by the constant 4Hz firing of dopamine neurons; rather that constant firing appears to create hotspots of dopamine. By contrast, the lower density of release sites and lower rate of reuptake in the ventral striatum creates a sustained concentration of dopamine. The second main conclusion is that D1 receptor (D1R) activation is able to track dopamine concentration changes at short delays but D2 receptor activation cannot. 

      The simulations of the dorsal striatum will be of interest to dopamine aficionados as they throw some doubt on the classic model of "tonic" and "phasic" dopamine actions, further show the disconnect between dopamine neuron firing and consequent release, and thus raise issues for the reward-prediction error theory of dopamine. 

      There is some careful work here checking the dependence of results on the spatial volume and its discretisation. The simulations of dopamine concentration are checked over a range of values for key parameters. The model is good, the simulations are well done, and the evidence for robust differences between dorsal and ventral striatum dopamine concentration is good. 

      However, the main weakness here is that neither of the main conclusions is strongly evidenced as yet. The claim that the dorsal striatum has no "tonic" dopamine concentration is based on the single example simulation of Figure 1 not the extensive simulations over a range of parameters. Some of those later simulations seem to show that the dorsal striatum can have a "tonic" dopamine concentration, though the measurement of this is indirect. It is not clear why the reader should believe the example simulation over those in the robustness checks, for example by identifying which range of parameter values is more realistic.”

      We appreciate that the reviewer finds our work interesting and carefully performed.The reviewer is correct that DA dynamics, including the presence and level of tonic DA, are parameter-dependent in both the dorsal striatum (DS) and ventral striatum (VS). Indeed, our simulations across a broad range of biological parameters were intended to help readers understand how such variation would impact the model’s outcomes, particularly since many of the parameters remain contested. Naturally, altering these parameters results in changes to the observed dynamics. However, to derive possible conclusions, we selected a subset of parameters that we believe best reflect the physiological conditions, as elaborated in the manuscript. In response to the reviewer’s comment, we have placed greater emphasis on clarifying which parameter values we believe reflect the physiological conditions the most (see lines 155-157 and 254-255). Additionally, we have underscored that the distinction between tonic and non-tonic states is not a binary outcome but a parameter-dependent continuum (lines 222-225)—one that our model now allows researchers to explore systematically.  Finally, we have highlighted how our simulations across parameter space not only capture this continuum but also identify the regimes that produce the most heterogeneous DA signaling, both within and across striatal regions (lines 266-268).  

      “The claim that D1Rs can track rapid changes in dopamine is not well supported. It is based on a single simulation in Figure 1 (DS) and 2 (VS) by visual inspection of simulated dopamine concentration traces - and even then it is unclear that D1Rs actually track dynamics because they clearly do not track rapid changes in dopamine that are almost as large as those driven by bursts (cf Figure 1i).”

      We would like to draw the attention to Figure 1I, where the claim that D1R track rapid changes is supported in more depth (Figure S1 in original manuscript - moved to main figure to highlight this in the revised manuscript). According to this figure, upon coordinated burst firing, the D1R occupancy rapidly increased as diffusion no longer equilibrated the extracellular concentrations on a timescale faster than the receptors – and D1R receptor occupancy closely tracked extracellular DA with a delay on the order of tens of milliseconds. Note that the brief increases in [DA] from uncoordinated stochastic release events from tonic firing in Figure 1H are too brief to drive D1 signaling, as the DA concentration diffuses into the remaining extracellular space on a timescale of 1-5 ms. This is faster than the receptors response rate and does not lead to any downstream signaling according to our simulations. This means D1 kinetics are rapid enough to track coordinated signaling on a ~50 ms timescale and slower, but not fast enough to respond to individual release events from tonic activity.

      “The claim also depends on two things that are poorly explained. First, the model of binding here is missing from the text. It seems to be a simple bound-fraction model, simulating a single D1 or D2 receptor. It is unclear whether more complex models would show the same thing.”

      We realize that this is not made clear in the methods and, accordingly, we have updated the method section to elaborate on how we model receptor binding. The model simulates occupied fraction of D1R and D2R in every single voxel of the simulation space. Please see lines 546-555.

      “Second, crucial to the receptor model here is the inference that D1 receptor unbinding is rapid; but this inference is made based on the kinetics of dopamine sensors and is superficially explained - it is unclear why sensor kinetics should let us extrapolate to receptor kinetics, and unclear how safe is the extrapolation of the linear regression by an order of magnitude to get the D1 unbinding rate.”

      We chose to use the sensors because it was possible to estimate precise affinities/off-rates from the fluorescent measurements. Although there might some variation in affinities that could be attributable to the mutations introduced in the sensors, the data clearly separated D1R and D2R with a D1R affinity of ~1000 nM and a D2R affinity of ~7 nM (Labouesse & Patriarchi, 2021) consistent with earlier predictions of receptor affinities. From our assessment of the literature, we found that this was the most reasonable way to estimate affinities and thereby off-rates. Importantly, the model has been made publicly available, so should new measurements arise, the simulations can be rerun with tweaks to the input parameters. To address the concern, we have also expanded a bit on the logic applied in the updated manuscript (please see lines 177-178).

      Reviewing editor Comments : 

      The paper could benefit from a critical confrontation not only with existing modeling work as mentioned by the reviewers, but also with existing empirical data on pauses, D2 MSN excitability, and plasticity/learning.”

      We thank both the editor and the reviewers for their suggestions on how to improve the manuscript. We have incorporated further modelling on D1R and D2R response to pauses and bursts and expanded our discussion of the results in relation to existing evidence (please see our responses to the reviewers above and the revised text in the manuscript).

      Reviewer #1 (Recommendations for the authors): 

      “(1) Many figure panels are too small to read clearly - e.g. "cross-section over time" plots.”

      We agree with the reviewer and have increased the size of panels in several of the figures.

      (2) Supplementary Videos of the model in action might be useful (and fun to watch).”

      Great idea. We have generated videos of both bursts in the 3D projections and the resulting D1R and D2R occupancy in 2D. The videos are included as supplementary material as Videos S1 and S2 and referred to in the text of the revised manuscript.

      ” (3) Line 305: " Further, the cusp-like behaviour of Vmax in VS was independent of both Q and R%..." 

      It is not clear what the "cusp" refers to here.”

      We agree this is a confusing sentence. We have rewritten and eliminated the use of the vague “cusp” terminology in the manuscript.

      ” (4) Line 311: "We therefore reanalysed data from our previously published comparison of fibre photometry and microdialysis and found evidence of natural variations in the release-uptake balance of the mice (Figure 5F,G)" This figure seems to be missing altogether.”

      The manuscript missed “S” in the mentioned sentence to indicate a supplementary figure. We apologies for the confusion and have corrected the text.

      (5) Figure 1: 

      1b: need numbers on the color scale.”

      We have added numbers in the updated manuscript.

      ”1c: adding an earlier line (e.g. 2ms) could be helpful?”

      We have added a 2 ms line to aid the readers.

      ”1d: do the colors show DA concentration on the visible surfaces of the cube or some form of projection?”

      The colors show concentrations on the surface. We have expanded the text to clarify this.

      ”1e: is this "cross-section" a randomly-selected line (i.e. 1D) through the cube?”

      The cross-section is midway through the cube. We have clarified this in the text.

      ”1f: "density" misspelled.”

      We thank the reviewer for the keen eye. The error has been corrected.

      ”1g: color bars indicating stimulation time would be improved if they showed the individual stimulation pulses instead.”

      The burst is simulated as a Poisson distribution and individual pulses may therefore be misleading.

      ” Why does the burst simulation include all release sites in a 10x10x10µm cube? Please justify this parameter choice.

      1h: "1/10" - the "10" is meaningless for a single pulse, right?”

      Yes, we agree. 

      ”1i: is this the concentration for a single voxel? Or the average of voxels that are all 1µm from one specific release site?”

      Thank you for pointing out the confusing language. The figure is for a voxel containing a release site (with a voxel size of 1 um in diameter).

      The legend seems a bit different from the description in the main text ("within 1µm"). As it stands, I also can't tell whether the small DA peaks are related to that particular release site, or to others. 

      We have updated the text to clear up the confusing language.

      ” (6) Figure 2: 

      2h: I'm not sure that the "relative occupancy" normalized measure is the most helpful here.”

      We believe the figure aids to illustrate the sphere of influence on receptors from a single burst is greater in VS than DS, suggesting DS can process information with tighter spatial control. Using a relative measure allows for more accessible comparison of the sphere of influence in a single figure. 

      ” (7) Figure 3: 

      The schematics need improvement.

      3a – would be more useful if it corresponded better to the actual simulation (e.g. we had a spatial scale shown). 

      3d – is this really useful, given the number of molecules shown is so much lower than in the simulation? 

      3h, 3j – need more explanation, e.g. axis labels. ”

      The schematics are intended to quickly inform the readers what parameters are tuned in the following figures, and not to be exact representations. However, we agree Figures 3h and 3j need axis labels, and we have accordingly added these.

      (8) Figure 4: 

      4m, n were not clearly explained. 

      We agree and have elaborated the explanation of these figures in the manuscript (lines 374-377.

      ” (9) From Figure S1 it appears that the definition of "DS" and "VS" used is above and below the anterior commissure, respectively. This doesn't seem reasonable - many if not most studies of "VS" have examined the nucleus accumbens core, which extends above the anterior commissure. Instead, it seems like the DAT expression difference observed is primarily a difference between accumbens Shell and the rest of the striatum, rather than DS vs VS.”

      We assume that the reviewer refers to Figure S3 and not S1. First, we would like to highlight that we had mislabeled VMAT2 and DAT in Figure S3C (now corrected). Apologies for the confusion. Second, as for striatal subregions, we have intentionally not distinguished between different subregions of the ventral striatum. The majority of literature we base our parameters on do not specify between e.g., NAcC vs. NAcS or DLS vs. DMS. The four slices we examined in Figure 3A-C were not perfectly aligned in the accumbal region, and we therefore do not believe we can draw any conclusions between core and shell.

      Reviewer #2 (Recommendations for the authors): 

      (1) Modelling assumptions: 

      The burst activity simulations seem conceptually flawed. How were release sites assigned to the 150 neurons? The burst activity simulations such as Figure 1g show a spatially localised release, but this means either (1) the release sites for one DA neuron are all locally clustered, or (2) only some release sites for each DA neuron are receiving a burst of APs, those release sites are close together, and the DA neurons' other release sites are not receiving the burst. Either way, this is not plausible.”

      We apologize for the confusion; however, we disagree that the simulations seem conceptually flawed. It is important to note that the burst simulation is spatially restricted to investigate local DA dynamics and how well different parts of the striatum can gate spill-over and receptor activation. The conditions may mimic local action potentials generated by nicotinic receptor activation (see e.g. Liu et al. Science 2022 or Matityahu et al, Nature Comm 2023), We have accordingly expanded on this is the manuscript on lines 148-151.

      (2) Data and its reporting: 

      Comparison to May and Wightman data: if we're meant to compare DS and VS concentrations, then plot them together; what were the experimental results (just says "closely resembled the earlier findings")?”

      Unfortunately, the quantitative values of the May and Wightman (1989) data are not publicly available. We are therefore limited to visual comparison and cannot replot the values.

      ” Figures S3b and c do not agree: Figure S3b shows DAT staining dropping considerably in VS; Fig 3c does not, and neither do the quoted statistics.”

      We had accidentally mixed up the labels in Figure S3c. Thank you for spotting this. We have corrected this in the updated manuscript.

      ” How robust are the results of simulations of the same parameter set? Figures S3D and E imply 5 simulations per burst paradigm, but these are not described.”

      The bursts are simulated with a Poisson distribution as described in Methods under Three-dimensional finite difference model. This induces a stochastic variation in the simulations that mimics the empirical observations (see Dreyer et al., J. Neurosci., 2010).

      ” I found it rather odd that the robustness of the receptor binding results is not checked across the changes in model parameters. This seems necessary because most of the changes, such as increasing the quantal release or the number of sites, will obviously increase dopamine concentration, but they do not necessarily meaningfully increase receptor activation because of saturation (and, in more complex receptor binding models, because of the number of available receptors).”

      This is an excellent point. However, we decided not to address this in the present study as we would argue that such additional simulations are not a necessity for our main conclusions. Instead, we decided in the revised version to focus on simulations mirroring a range of different receptor affinities as described in detail above. 

      ” Figure 4H: how can unclustered simulations have a different concentration at the centre of a "cluster" than outside, when the uptake is homogenous? Why is clustering of DAT "efficient"? [line 359]”

      This is a great observation. The drop is compared to the average of the simulation space. Despite no clusters, the uniform scenario still has a concentration gradient towards the surface of the varicosity. We have elaborated on this in the manuscript on lines 346-349.

      ” The Discussion conclusions about what D1Rs and D2Rs cannot track are not tested in the paper (e.g. ramps). Either test them or make clear what is speculation.”

      An excellent point that some of the claims in the discussion were not fully supported. We have added a simulation with a chain of burst firings to highlight how the temporal integration differs between the two receptors and updated the wording in the discussion to exclude ramps as this was not explicitly tested. See lines 191-193 and Figure S1G.

      ” (3) Organisation of paper: 

      Consistency of terminology. These terms seem to be used to describe the same thing, but it is unclear if they are: release sites, active terminals (Table 1), varicosity density. Likewise: release probability, release fraction.”

      Thank you for pointing this out. We have revised the manuscript and cleared up terminology on release sites. However, release probability and release-capable fraction of varicosities are two separate concepts.

      ” The references to the supplementary figure are not in sequence, and the panels assigned to the supplemental figures seem arbitrary in what is assigned to each figure and their ordering. As Figures 1 and 2 are to be directly compared, so plot the same results in each. Figure S1F is discussed as a key result, but is in a supplemental figure. ”

      Thank you for identifying this. We have updated figure references and further moved Figure S1F into the main as we agree this is a main finding.

      ” The paper frequently reads as a loose collection of observations of simulations. For example, why look at the competitive inhibition of DA by cocaine [Fig 3H-I]? The nanoclustering of DAT (Figure 4) seems to be partial work from a different paper - it is unclear why the Vmax results warrant that detailed treatment here, especially as no rationale is offered for why we would want Vmax to change.”

      We apologize if the paper reads as a loose collection of observations of simulations. This is certainly not the case. As for the cocaine competition, we used this because this modulates the Km value for DA and because we wanted to examine how dependent the dopamine dynamics are to changing different parameters in the model (Km in this case). We noticed Vmax had a separate effect between DS and VS. Accordingly, we gave it particular focus because it is physiological parameter than be modified and, if modified, it can have potential large impact on striatal DA dynamics.  Importantly, it is well known that the DA transporter (DAT) is subject to cellular regulation of its surface expression e.g. by internalization /recycling and thereby of uptake capacity (Vmax). Furthermore, we demonstrate in the present study evidence that uptake capacity on a much faster time scale can be modulated by nanoclustering, which posits a potentially novel type of synaptic plasticity. We find this rather interesting and decided therefore to focus on this in the manuscript. 

      ” What are the axes in Figure 3H and Figure 3J?”

      We have updated the figures to include axis. Thank you for pointing out this omission.

      ” Much is made of the sensitivity to Vmax in VS versus DS, but this was hard work to understand. It took me a while to work out that Figure 3K was meant to indicate the range of Vmax that would be changed in VS and DS respectively. "Cusp-like behaviour" (line 305) is unclear.”

      We agree that the original language was unclear – including the terminology “cusplike behavior”. We have updated the description and cut the confusion terminology. See line 366.

      ” The treatment of highly relevant prior work, especially that of Hunger et al 2020 and Dreyer et al (2010, 2014), is poor, being dismissed in a single paragraph late in the Discussion rather than explicating how the current paper's results fit into the context of that work. The authors may also want to discuss the anticipation of their conclusions by Wickens and colleagues, including dopamine hotspots (https://doi.org/10.1016/j.tins.2006.12.003) and differences between DS and VS dopamine release (https://doi.org/10.1196/annals.1390.016).”

      We thank the reviewer for the suggested discussion points and have included and discussed references to the work by Wickens and colleagues (see lines 407-411 and 418-420).

      ” (4) Methods: 

      Clarify the FSCV simulations: the function I_FSCV was convolved with the simulated [DA] signal?”

      Yes. We have clarified this in the method section on lines 593-594.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review): 

      Summary:

      The authors of this study sought to define a role for IgM in responses to house dust mites in the lung. 

      Strengths: 

      Unexpected observation about IgM biology 

      Combination of experiments to elucidate function 

      Weaknesses: 

      Would love more connection to human disease 

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.   

      Reviewer #2 (Public Review): 

      Summary: 

      The manuscript by Hadebe and colleagues describes a striking reduction in airway hyperresponsiveness in Igm-deficient mice in response to HDM, OVA and papain across the B6 and BALB-c backgrounds. The authors suggest that the deficit is not due to improper type 2 immune responses, nor an aberrant B cell response, despite a lack of class switching in these mice. Through RNA-Seq approaches, the authors identify few di]erences between the lungs of WT and Igm-deficient mice, but see that two genes involved in actin regulation are greatly reduced in IgM-deficient mice. The authors target these genes by CRISPR-Cas9 in in vitro assays of smooth muscle cells to show that these may regulate cell contraction. While the study is conceptually interesting, there are a number of limitations, which stop us from drawing meaningful conclusions. 

      Strengths:

      Fig. 1. The authors clearly show that IgMKO mice have striking reduced AHR in the HDM model, despite the presence of a good cellular B cell response. 

      Weaknesses: 

      Fig. 2. The authors characterize the cd4 t cell response to HDM in IGMKO mice.They have restimulated medLN cells with antiCD3 for 5 days to look for IL-4 and IL-13, and find no discernible di]erence between WT and KO mice. The absence of PBStreated WT and KO mice in this analysis means it is unclear if HDM-challenged mice are showing IL-4 or IL-13 levels above that seen at baseline in this assay. 

      We thank the Reviewer for this comment. We would like to mention that a very minimal level of IL-4 and IL-13 in PBS mice was detected. We have indicated with a dotted line on the Figure 2B to show levels in unstimulated or naïve cytokines. Please see Author response image 1 below from anti-CD3 stimulated cytokine ELISA data. The levels of these cytokines are very low (not detectable) and are not changed in control WT and IgM- KO mice challenge with PBS, this is also true for PMA/ionomycin-stimulated cells

      Author response image 1.

      The choice of 5 days is strange, given that the response the authors want to see is in already primed cells. A 1-2 day assay would have been better. 

      We agree with the reviewer that a shorter stimulation period would work. Over the years we have settled for 5-day re-stimulation for both anti-CD3 and HDM. We have tried other time points, but we consistently get better secretion of cytokines after 5 days. 

      It is concerning that the authors state that HDM restimulation did not induce cytokine production from medLN cells, since countless studies have shown that restimulation of medLN would induce IL-13, IL-5 and IL-10 production from medLN. This indicates that the sensitization and challenge model used by the authors is not working as it should. 

      We thank the reviewer for this observation. In our recent paper showing how antigen load a]ects B cell function, we used very low levels of HDM to sensitise and challenge mice (1 ug and 3 ug respectively). See below article, Hadebe et al., 2021 JACI. This is because Labs that have used these low HDM levels also suggested that antigen load impacts B cell function, especially in their role in germinal centres. We believe the reason we see low or undetectable levels of cytokines is because of this low antigen load sensitisation and challenge. In other manuscripts we have published or about to publish, we have shown that normal HDM sensitisation load (1 ug or 100 ug) and challenge (10 ug) do induce cytokine release upon restimulation with HDM. See the below article by Khumalo et al, 2020 JCI Insight (Figure 4A).

      Sabelo Hadebe*, Jermaine Khumalo, Sandisiwe Mangali, Nontobeko Mthembu, Hlumani Ndlovu, Amkele Ngomti, Martyna Scibiorek, Frank Kirstein, Frank Brombacher*. Deletion of IL-4Ra signalling on B cells limits hyperresponsiveness depending on antigen load. doi.org/10.1016/j.jaci.2020.12.635).

      Jermaine Khumalo, Frank Kirstein, Sabelo Hadebe*, Frank Brombacher*. IL-4Rα signalling in regulatory T cells is required for dampening allergic airway inflammation through inhibition of IL-33 by type 2 innate lymphoid cells. JCI Insight. 2020 Oct 15;5(20):e136206. doi: 10.1172/jci.insight.136206

      The IL-13 staining shown in panel c is also not definitive. One should be able to optimize their assays to achieve a better level of staining, to my mind. 

      We agree with the reviewer that much higher IL-13-producing CD4 T cells should be observed. We don’t think this is a technical glitch or non-optimal set-up as we see much higher levels of IL-13-producing CD4 T cells when using higher doses of HDM to sensitise and challenge, say between 7 -20% in WT mice (see Author response image 2 of lung stimulated with PMA/ionomycin+Monensin, please note this is for illustration purposes only and it not linked to the current manuscript, its merely to demonstrate a point from other experiments we have conducted in the lab).

      Author response image 2.

      In d-f, the authors perform a serum transfer, but they only do this once. The half life of IgM is quite short. The authors should perform multiple naïve serum transfers to see if this is enough to induce FULL AHR. 

      We thank the reviewer for this comment. We apologise if this was not clear enough on the Figure legend and method, we did transfer serum 3x, a day before sensitisation, on the day of sensitisation and a day before the challenge to circumvent the short life of IgM. In our subsequent experiments, we have now used busulfan to deplete all bone marrow in IgM-deficient mice and replace it with WT bone marrow and this method restores AHR (Figure 3B).

      This now appears in line 515 to 519 and reads

      Adoptive transfer of naïve serum

      Naïve wild-type mice were euthanised and blood was collected via cardiac puncture before being spun down (5500rpm, 10min, RT) to collect serum. Serum (200µL) was injected intraperitoneally into IgM-deficient mice. Serum was injected intraperitoneally at day -1, 0, and a day before the challenge with HDM (day 10).

      The presence of negative values of total IgE in panel F would indicate some errors in calculation of serum IgE concentrations. 

      We thank the reviewer for this observation. For better clarity, we have now indicated these values as undetected in Figure 2F, as they were below our detection limit.

      Overall, it is hard to be convinced that IgM-deficiency does not lead to a reduction in Th2 inflammation, since the assays appear suboptimal. 

      We disagree with the reviewer in this instance, because we have shown in 3 di]erent models and in 2 di]erent strains and 2 doses of HDM (high and low) that no matter what you do, Th2 remains intact. Our reason for choosing low dose HDM was based on our previous work and that of others, which showed that depending on antigen load, B cells can either be redundant or have functional roles. Since our interest was to tease out the role of B cells and specifically IgM, it was important that we look at a scenario where B cells are known to have a function (low antigen load). We did find similar findings at high dose of HDM load, but e]ects on AHR were not as strong, but Th2 was not changed, in fact in some instances Th2 was higher in IgM-deficient mice.

      Fig. 3. Gene expression di]erences between WT and KO mice in PBS and HDM challenged settings are shown. PCA analysis does not show clear di]erences between all four groups, but genes are certainly up and downregulated, in particular when comparing PBS to HDM challenged mice. In both PBS and HDM challenged settings, three genes stand out as being upregulated in WT v KO mice. these are Baiap2l1, erdr1 and Chil1. 

      Noted

      Fig. 4. The authors attempt to quantify BAIAP2L1 in mouse lungs. It is di]icult to know if the antibody used really detects the correct protein. A BAIAP2L1-KO is not used as a control for staining, and I am not sure if competitive assays for BAIAP2L1 can be set up. The flow data is not convincing. The immunohistochemistry shows BAIAP2L1 (in red) in many, many cells, essentially throughout the section. There is also no discernible di]erence between WT and KO mice, which one might have expected based on the RNA-Seq data. So, from my perspective, it is hard to say if/where this protein is located, and whether there truly exists a di]erence in expression between wt and ko mice. 

      We thank the reviewer for this comment. We are certain that the antibody does detect BAIAP2L1, we have used it in 3 assays, which we admit may show varying specificities since it’s a Polyclonal antibody. However, in our western blot (Figure 5A), the antibody detects a band at 56.7kDa, apart from what we think are isoforms. We agree that BAIAP2L1 is expressed by many cell types, including CD45+ cells and alpha smooth muscle negative cells and we show this in our Figure 5 – figure supplement 1A and B. Where we think there is a di]erence in expression between WT and IgM-deficient mice is in alpha-smooth muscle-positive cells. We have tested antibodies from di]erent companies (Proteintech and Abcam), and we find similar findings. We do not have access to BAIAP2L1 KO mice and to test specificity, we have also used single stain controls with or without secondary antibody and isotype control which show no binding in western blot and Immunofluorescence assays and Fluorescence minus one antibody in Flow cytometry, so that way we are convinced that the signal we are seeing is specific to BAIAP2L1.

      Here we have also added additional Flow cytometry images using anti-BAIAP2L1 (clone 25692-1-AP) from Proteintech

      Author response image 3.

      Figure similar to Figure 5C and Figure 5 -figure supplement 1A and B.

      Fig. 5 and 6. The authors use a single cell contractility assay to measure whether BAIAP2L1 and ERDR1 impact on bronchial smooth muscle cell contractility. I am not familiar with the assay, but it looks like an interesting way of analysing contractility at the single cell level.

      The authors state that targeting these two genes with Cas9gRNA reduces smooth muscle cell contractility, and the data presented for contractility supports this observation. However, the e]iciency of Cas9-mediated deletion is very unclear. The authors present a PCR in supp fig 9c as evidence of gene deletion, but it is entirely unclear with what e]iciency the gene has been deleted. One should use sequencing to confirm deletion. Moreover, if the antibody was truly working, one should be able to use the antibody used in Fig 4 to detect BAIAP2L1 levels in these cells. The authors do not appear to have tried this. 

      We thank the reviewer for these observations. We are in a process to optimise this using new polyclonal BAIAP2L1 antibodies from other companies, since the one we have tried doesn’t seem to work well on human cells via western blot. So hopefully in our new version, we will be able to demonstrate this by immunofluorescence or western blot.

      Other impressions: 

      The paper is lacking a link between the deficiency of IgM and the e]ects on smooth muscle cell contraction. 

      The levels of IL-13 and TNF in lavage of WT and IGMKO mice could be analysed. 

      We have measured Th2 cytokine IL-13 in BAL fluid and found no di]erences between IgM-deficient mice and WT mice challenged with HDM (Author response image 4 below). We could not detected TNF-alpha in the BAL fluid, it was below detection limit.

      Figure legend. IL-13 levels are not changed in IgM-deficient mice in the lung. Bronchoalveolar lavage fluid in WT or IgM-deficient mice sensitised and challenged with HDM. TNF-a levels were below the detection limit.

      Author response image 4.

      Moreover, what is the impact of IgM itself on smooth muscle cells? In the Fig. 7 schematic, are the authors proposing a direct role for IgM on smooth muscle cells? Does IgM in cell culture media induce contraction of SMC? This could be tested and would be interesting, to my mind. 

      We thank the Reviewer for these comments. We are still trying to test this, unfortunately, we have experienced delays in getting reagents such as human IgM to South Africa. We hope that we will be able to add this in our subsequent versions of the article. We agree it is an interesting experiment to do even if not for this manuscript but for our general understanding of this interaction at least in an in vitro system.

      Reviewer #3 (Public Review): 

      Summary: 

      This paper by Sabelo et al. describes a new pathway by which lack of IgM in the mouse lowers bronchial hyperresponsiveness (BHR) in response to metacholine in several mouse models of allergic airway inflammation in Balb/c mice and C57/Bl6 mice. Strikingly, loss of IgM does not lead to less eosinophilic airway inflammation, Th2 cytokine production or mucus metaplasia, but to a selective loss of BHR. This occurs irrespective of the dose of allergen used. This was important to address since several prior models of HDM allergy have shown that the contribution of B cells to airway inflammation and BHR is dose dependent. 

      After a description of the phenotype, the authors try to elucidate the mechanisms. There is no loss of B cells in these mice. However, there is a lack of class switching to IgE and IgG1, with a concomitant increase in IgD. Restoring immunoglobulins with transfer of naïve serum in IgM deficient mice leads to restoration of allergen-specific IgE and IgG1 responses, which is not really explained in the paper how this might work. There is also no restoration of IgM responses, and concomitantly, the phenotype of reduced BHR still holds when serum is given, leading authors to conclude that the mechanism is IgE and IgG1 independent. Wild type B cell transfer also does not restore IgM responses, due to lack of engraftment of the B cells. Next authors do whole lung RNA sequencing and pinpoint reduced BAIAP2L1 mRNA as the culprit of the phenotype of IgM-/- mice. However, this cannot be validated fully on protein levels and immunohistology since di]erences between WT and IgM KO are not statistically significant, and B cell and IgM restoration are impossible. The histology and flow cytometry seems to suggest that expression is mainly found in alpha smooth muscle positive cells, which could still be smooth muscle cells or myofibroblasts. Next therefore, the authors move to CRISPR knock down of BAIAP2L1 in a human smooth muscle cell line, and show that loss leads to less contraction of these cells in vitro in a microscopic FLECS assay, in which smooth muscle cells bind to elastomeric contractible surfaces. 

      Strengths: 

      (1) There is a strong reduction in BHR in IgM-deficient mice, without alterations in B cell number, disconnected from e]ects on eosinophilia or Th2 cytokine production.

      (2) BAIAP2L1 has never been linked to asthma in mice or humans 

      Weaknesses: 

      (1) While the observations of reduced BHR in IgM deficient mice are strong, there is insu]icient mechanistic underpinning on how loss of IgM could lead to reduced expression of BAIAP2L1. Since it is impossible to restore IgM levels by either serum or B cell transfer and since protein levels of BAIAP2L1 are not significantly reduced, there is a lack of a causal relationship that this is the explanation for the lack of BHR in IgMdeficient mice. The reader is unclear if there is a fundamental (maybe developmental) di]erence in non-hematopoietic cells in these IgM-deficient mice (which might have accumulated another genetic mutation over the years). In this regard, it would be important to know if littermates were newly generated, or historically bred along with the KO line. 

      We thank the reviewer for asking this question and getting us to think of this in a di]erent way. This prompted us to use a di]erent method to try and restore IgM function and since our animal facility no longer allows irradiation, we opted for busulfan. We present this data as new data in Figure 3. We had to go back and breed this strain and then generated bone marrow chimeras. What we have shown now with chimeras is that if we can deplete bone marrow from IgM-deficient mice and replace it with congenic WT bone marrow when we allow these mice to rest for 2 months before challenge with HDM (Figure 3 -figure supplement 1A-C) We also show that AHR (resistance and elastance) is partially restored in this way (Figure 3A and B) as mice that receive congenic WT bone marrow after chemical irradiation can mount AHR and those that receive IgM-deficient bone marrow, can’t mount AHR upon challenge with HDM. If the mice had accumulated an unknown genetic mutation in non-hematopoietic cells, the transfer of WT bone marrow would not make a di]erence. So, we don’t believe the colony could have gained a mutation that we are unaware of. We have also shipped these mice to other groups and in their hands, this strains still only behaves as an IgM only knockout mice. See their publication below.

      Mark Noviski, James L Mueller, Anne Satterthwaite, Lee Ann Garrett-Sinha, Frank Brombacher, Julie Zikherman 2018. IgM and IgD B cell receptors di]erentially respond to endogenous antigens and control B cell fate. eLife 2018;7:e35074. DOI: https://doi.org/10.7554/eLife.35074

      we have also added methods for bone marrow chimaeras and added results sections and new Figures related to these methods.

      Methods appear in line 521-532 of the untracked version of the article.

      Busulfan Bone marrow chimeras

      WT (CD45.2) and IgM<sup>-/-</sup> (CD45.2) congenic mice were treated with 25 mg/kg busulfan (Sigma-Aldrich, Aston Manor, South Africa) per day for 3 consecutive days (75 mg/kg in total) dissolved in 10% DMSO and Phosphate bu]ered saline (0.2mL, intraperitoneally) to ablate bone marrow cells. Twenty-four hours after last administration of busulfan, mice were injected intravenously with fresh bone marrow (10x10<sup>6</sup> cells, 100µL) isolated from hind leg femurs of either WT (CD45.1) or IgM<sup>-/-</sup> mice [33]. Animals were then allowed to complement their haematopoietic cells for 8 weeks. In some experiments the level of bone marrow ablation was assessed 4 days post-busulfan treatment in mice that did not receive donor cells. At the end of experiment level of complemented cells were also assessed in WT and IgM<sup>-/-</sup> mice that received WT (CD45.1) bone marrow. 

      Results appear in line 198-228 of the untracked version of the article

      Replacement of IgM-deficient mice with functional hematopoietic cells in busulfan mice chimeric mice restores airway hyperresponsiveness.

      We then generated bone marrow chimeras by chemical radiation using busulfan (Montecino-Rodriguez and Dorshkind, 2020). We treated mice three times with busulfan for 3 consecutive days and after 24 hrs transferred naïve bone marrow from congenic CD45.1 WT mice or CD45.2 IgM KO mice (Figure 3A and Figure 3 -figure supplement 1A). We showed that recipient mice that did not receive donor bone marrow after 4 days post-treatment had significantly reduced lineage markers (CD45<sup>+</sup>Sca-1<sup>+</sup>) or lineage negative (Lin<sup>-</sup>) cells in the bone marrow when compared to untreated or vehicle (10% DMSO) treated mice (Figure 3 -figure supplements 1B-C). We allowed mice to reconstitute bone marrow for 8 weeks before sensitisation and challenge with low dose HDM (Figure 3A). We showed that WT (CD45.2) recipient mice that received WT (CD45.1) donor bone marrow had higher airway resistance and elastance and this was comparable to IgM KO (CD45.2) recipient mice that received donor WT (CD45.1) bone marrow (Figure 3B). As expected, IgM KO (CD45.2) recipient mice that received donor IgM KO (CD45.2) bone marrow had significantly lower AHR compared to WT (CD45.2) or IgM KO (CD45.2) recipient mice that received WT (CD45.1) bone marrow (Figure 3B). We confirmed that the di]erences observed were not due to di]erences in bone marrow reconstitution as we saw similar frequencies of CD45.1 cells within the lymphocyte populations in the lungs and other tissues (Figure 3 -figure supplement 1D). We observed no significant changes in the lung neutrophils, eosinophils, inflammatory macrophages, CD4 T cells or B cells in WT or IgM KO (CD45.2) recipient mice that received donor WT (CD45.1/CD45.2) or IgM KO (CD45.2) bone marrow when sensitised and challenged with low dose HDM (Figure 3C).

      Restoring IgM function through adoptive reconstitution with congenic CD45.1 bone marrow in non-chemically irradiated recipient mice or sorted B cells into IgM KO mice (Figure 2 -figure supplement 1A) did not replenish IgM B cells to levels observed in WT mice and as a result did not restore AHR, total IgE and IgM in these mice (Figure 2 -figure supplements 1B-C). 

      The 2 new figures are Figure 3 which moved the rest of the Figures down and Figure 3- figure supplement 1AD), which also moved the rest of the supplementary figures down.

      Discussion appears in line 410-419 of the untracked version of the article.To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      (2) There is no mention of the potential role of complement in activation of AHR, which might be altered in IgM-deficient mice   

      We thank the reviewer for this comment. We have not directly looked at complement in this instance, however, from our previous work on C3 knockout mice, there have been comparable AHR to WT mice under the HDM challenge.

      (3) What is the contribution of elevated IgD in the phenotype of the IgM-deficient mice. It has been described by this group that IgD levels are clearly elevated 

      We thank the reviewer for this question. We believe that IgD is essentially what drives partial class switching to IgG, we certainly have shown that in the case of VSV virus and Trypanosoma congolense and Trypanosoma brucei brucei that elevated IgD drive delayed but e]ective IgG in the absence of IgM (Lutz et al, 2001, Nature). This is also confirmed by Noviski et al., 2018 eLife study where they show that both IgM and IgD do share some endogenous antigens, so its likely that external antigens can activate IgD in a similar manner to prompt class switching.

      (4) How can transfer of naïve serum in class switching deficient IgM KO mice lead to restoration of allergen specific IgE and IgG1? 

      We thank the Reviewer for these comments, we believe that naïve sera transferred to IgM deficient mice is able to bind to the surface of B cells via IgM receptors (FcμR / Fcα/μR), which are still present on B cells and this is su]icient to facilitate class switching. Our IgM KO mouse lacks both membrane-bound and secreted IgM, and transferred serum contains at least secreted IgM which can bind to surfaces via its Fc portion. We measured HDM-specific IgE and we found very low levels, but these were not di]erent between WT and IgM KO adoptively transferred with WT serum. We also detected HDM-specific IgG1 in IgM KO transferred with WT sera to the same level as WT, confirming a possible class switching, of course, we can’t rule out that transferred sera also contains some IgG1. We also can’t rule out that elevated IgD levels can partially be responsible for class switched IgG1 as discussed above.

      In the discussion line 463-464, we also added the following

      “We speculate that IgM can directly activate smooth muscle cells by binding a number of its surface receptors including FcμR, Fcα/μR and pIgR (Liu et al., 2019; Nguyen et al., 2017b; Shibuya et al., 2000). IgM binds to FcμR strictly, but shares Fcα/μR and pIgR with IgA (Liu et al., 2019; Michaud et al., 2020; Nguyen et al., 2017b). Both Fcα/μR and pIgR can be expressed by non-structural cells at mucosal sites (Kim et al., 2014; Liu et al., 2019). We would not rule out that the mechanisms of muscle contraction might be through one of these IgM receptors, especially the ones expressed on smooth muscle cells(Kim et al., 2014; Liu et al., 2019). Certainly, our future studies will be directed towards characterizing the mechanism by which IgM potentially activates the smooth muscle.”

      We have discussed this section under Discussion section, line 731 to 757. In addition, since we have now performed bone marrow chimaeras we have further added the following in our discussion in line 410-419.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM. 

      We removed the following lines, after performing bone marrow chimaeras since this changed some aspects. 

      Our efforts to adoptively transfer wild-type bone marrow or sorted B cells into IgMdeficient mice were also largely unsuccessful partly due to poor engraftment of wildtype B cells into secondary lymphoid tissues. Natural secreted IgM is mainly produced by B1 cells in the peritoneal cavity, and it is likely that any transfer of B cells via bone marrow transfer would not be su]icient to restore soluble levels of IgM<sup>3,10</sup>.

      (5) lpha smooth muscle antigen is also expressed by myofibroblasts. This is insu]iciently worked out. The histology mentions "expression in cells in close contact with smooth muscle". This needs more detail since it is a very vague term. Is it in smooth muscle or in myofibroblasts. 

      We appreciate that alpha-smooth muscle actin-positive cells are a small fraction in the lung and even within CD45 negative cells, but their contribution to airway hyperresponsiveness is major. We also concede that by immunofluorescence BAIAP2L1 seems to be expressed by cells adjacent to alpha-smooth muscle actin (Figure 5B), however, we know that cells close to smooth muscle (such as extracellular matrix and myofibroblasts) contribute to its hypertrophy in allergic asthma.

      James AL, Elliot JG, Jones RL, Carroll ML, Mauad T, Bai TR, et al. Airway Smooth Muscle Hypertrophy and Hyperplasia in Asthma. Am J Respir Crit Care Med [Internet]. 2012; 185:1058–64. Available from: https://doi.org/10.1164/rccm.201110-1849OC

      (6) Have polymorphisms in BAIAP2L1 ever been linked to human asthma? 

      No, we have looked in asthma GWAS studies, at least summary statistics and we have not seen any SNPs that could be associated with human asthma.

      (7) IgM deficient patients are at increased risk for asthma. This paper suggests the opposite. So the translational potential is unclear 

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency as the reviewer correctly points out, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal or higher IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors trained a variational autoencoder (VAE) to create a high-dimensional "voice latent space" (VLS) using extensive voice samples, and analyzed how this space corresponds to brain activity through fMRI studies focusing on the temporal voice areas (TVAs). Their analyses included encoding and decoding techniques, as well as representational similarity analysis (RSA), which showed that the VLS could effectively map onto and predict brain activity patterns, allowing for the reconstruction of voice stimuli that preserve key aspects of speaker identity.

      Strengths:

      This paper is well-written and easy to follow. Most of the methods and results were clearly described. The authors combined a variety of analytical methods in neuroimaging studies, including encoding, decoding, and RSA. In addition to commonly used DNN encoding analysis, the authors performed DNN decoding and resynthesized the stimuli using VAE decoders. Furthermore, in addition to machine learning classifiers, the authors also included human behavioral tests to evaluate the reconstruction performance.

      Weaknesses:

      This manuscript presents a variational autoencoder (VAE) to evaluate voice identity representations from brain recordings. However, the study's scope is limited by testing only one model, leaving unclear how generalizable or impactful the findings are. The preservation of identity-related information in the voice latent space (VLS) is expected, given the VAE model's design to reconstruct original vocal stimuli. Nonetheless, the study lacks a deeper investigation into what specific aspects of auditory coding these latent dimensions represent. The results in Figure 1c-e merely tested a very limited set of speech features. Moreover, there is no analysis of how these features and the whole VAE model perform in standard speech tasks like speech recognition or phoneme recognition. It is not clear what kind of computations the VAE model presented in this work is capable of. Inclusion of comparisons with state-of-the-art unsupervised or self-supervised speech models known for their alignment with auditory cortical responses, such as Wav2Vec2, HuBERT, and Whisper, would strengthen the validation of the VAE model and provide insights into its relative capabilities and limitations.

      The claim that the VLS outperforms a linear model (LIN) in decoding tasks does not significantly advance our understanding of the underlying brain representations. Given the complexity of auditory processing, it is unsurprising that a nonlinear model would outperform a simpler linear counterpart. The study could be improved by incorporating a comparative analysis with alternative models that differ in architecture, computational strategies, or training methods. Such comparisons could elucidate specific features or capabilities of the VLS, offering a more nuanced understanding of its effectiveness and the computational principles it embodies. This approach would allow the authors to test specific hypotheses about how different aspects of the model contribute to its performance, providing a clearer picture of the shared coding in VLS and the brain.

      The manuscript overlooks some crucial alternative explanations for the discriminant representation of vocal identity. For instance, the discriminant representation of vocal identity can be either a higher-level abstract representation or a lower-level coding of pitch height. Prior studies using fMRI and ECoG have identified both types of representation within the superior temporal gyrus (STG) (e.g., Tang et al., Science 2017; Feng et al., NeuroImage 2021). Additionally, the methodology does not clarify whether the stimuli from different speakers contained identical speech content. If the speech content varied across speakers, the approach of averaging trials to obtain a mean vector for each speaker-the "identity-based analysis"-may not adequately control for confounding acoustic-phonetic features. Notably, the principal component 2 (PC2) in Figure 1b appears to correlate with absolute pitch height, suggesting that some aspects of the model's effectiveness might be attributed to simpler acoustic properties rather than complex identity-specific information.

      Methodologically, there are issues that warrant attention. In characterizing the autoencoder latent space, the authors initialized logistic regression classifiers 100 times and calculated the tstatistics using degrees of freedom (df) of 99. Given that logistic regression is a convex optimization problem typically converging to a global optimum, these multiple initializations of the classifier were likely not entirely independent. Consequently, the reported degrees of freedom and the effect size estimates might not accurately reflect the true variability and independence of the classifier outcomes. A more careful evaluation of these aspects is necessary to ensure the statistical robustness of the results.

      We thank Reviewer #1 for their thoughtful and constructive comments. Below, we address the key points raised:

      New comparitive models. We agree there are still many open questions on the structure of the VLS and the specific aspects of auditory coding that its latent dimensions represent. The features tested in Figure 1c-e are not speech features, but aspects related to speaker identity: age, gender and unique identity. Nevertheless we agree the VLS could be compared to recent speech models (not available when we started this project): we have now included comparisons with Wav2Vec and HuBERT in the encoding section (new Figure 2-S3). The comparison of encoding results based on LIN, the VLS, Wav2Vec and HuBERT (new Fig2S3) indicates no clear superiority of one model over the others; rather, different sets of voxels are better explained by the different models. Interestingly all four models yielded best encoding results for the m and a TVA, indicating some consistency across models.

      On decoding directly from spectrograms. We have now added decoding results obtained directly from spectrograms, as requested in the private review. These are presented in the revised Figure 4, and allow for comparison with the LIN- and VLS-based reconstructions. As noted, spectrogram-based reconstructions sounded less vocal-like and faithful to the original, confirming that the latent spaces capture more abstract and cerebral-like voice representations.

      On the number and length of stimuli. The rationale for using a large number of brief, randomly spliced speech excerpts from different languages was to extract identity features independent of specific linguistic cues. Indeed, the PC2 could very well correlate with pitch; we were not able to extract reliable f0 information from the thousands of brief stimuli, many of which are largely inharmonic (e.g., fricatives), such that this assumption could not be tested empirically. But it would be relevant that the weight of PC2 correlates with pitch: although the average fundamental frequency of phonation is not a linguistic cue, it is a major acoustical feature differentiating speaker identities.

      Statistics correction.  To address the issue of potential dependence between multiple runs of logistic regression, we replaced our previous analysis with a Wilcoxon signedrank test comparing decoding accuracies to chance. The results remain significant across classifications, and the revised figure and text reflect this change.

      Reviewer #2 (Public Review):

      Summary:

      Lamothe et al. collected fMRI responses to many voice stimuli in 3 subjects. The authors trained two different autoencoders on voice audio samples and predicted latent space embeddings from the fMRI responses, allowing the voice spectrograms to be reconstructed. The degree to which reconstructions from different auditory ROIs correctly represented speaker identity, gender, or age was assessed by machine classification and human listener evaluations. Complementing this, the representational content was also assessed using representational similarity analysis. The results broadly concur with the notion that temporal voice areas are sensitive to different types of categorical voice information.

      Strengths:

      The single-subject approach that allows thousands of responses to unique stimuli to be recorded and analyzed is powerful. The idea of using this approach to probe cortical voice representations is strong and the experiment is technically solid.

      Weaknesses:

      The paper could benefit from more discussion of the assumptions behind the reconstruction analyses and the conclusions it allows. The authors write that reconstruction of a stimulus from brain responses represents 'a robust test of the adequacy of models of brain activity' (L138). I concur that stimulus reconstruction is useful for evaluating the nature of representations, but the notion that they can test the adequacy of the specific autoencoder presented here as a model of brain activity should be discussed at more length. Natural sounds are correlated in many feature dimensions and can therefore be summarized in several ways, and similar information can be read out from different model representations. Models trained to reconstruct natural stimuli can exploit many correlated features and it is quite possible that very different models based on different features can be used for similar reconstructions. Reconstructability does not by itself imply that the model is an accurate brain model. Non-linear networks trained on natural stimuli are arguably not tested in the same rigorous manner as models built to explicitly account for computations (they can generate predictions and experiments can be designed to test those predictions). While it is true that there is increasing evidence that neural network embeddings can predict brain data well, it is still a matter of debate whether good predictability by itself qualifies DNNs as 'plausible computational models for investigating brain processes' (L72). This concern is amplified in the context of decoding and naturalistic stimuli where many correlated features can be represented in many ways. It is unclear how much the results hinge on the specificities of the specific autoencoder architectures used. For instance, it would be useful to know the motivations for why the specific VAE used here should constitute a good model for probing neural voice representations.

      Relatedly, it is not clear how VAEs as generative models are motivated as computational models of voice representations in the brain. The task of voice areas in the brain is not to generate voice stimuli but to discriminate and extract information. The task of reconstructing an input spectrogram is perhaps useful for probing information content, but discriminative models, e.g., trained on the task of discriminating voices, would seem more obvious candidates. Why not include discriminatively trained models for comparison?

      The autoencoder learns a mapping from latent space to well-formed voice spectrograms. Regularized regression then learns a mapping between this latent space and activity space. All reconstructions might sound 'natural', which simply means that the autoencoder works. It would be good to have a stronger test of how close the reconstructions are to the original stimulus. For instance, is the reconstruction the closest stimulus to the original in latent space coordinates out of using the experimental stimuli, or where does it rank? How do small changes in beta amplitudes impact the reconstruction? The effective dimensionality of the activity space could be estimated, e.g. by PCA of the voice samples' contrast maps, and it could then be estimated how the main directions in the activity space map to differences in latent space. It would be good to get a better grasp of the granularity of information that can be decoded/ reconstructed.

      What can we make of the apparent trend that LIN is higher than VLS for identity classification (at least VLS does not outperform LIN)? A general argument of the paper seems to be that VLS is a better model of voice representations compared to LIN as a 'control' model. Then we would expect VLS to perform better on identity classification. The age and gender of a voice can likely be classified from many acoustic features that may not require dedicated voice processing.

      The RDM results reported are significant only for some subjects and in some ROIs. This presumably means that results are not significant in the other subjects. Yet, the authors assert general conclusions (e.g. the VLS better explains RDM in TVA than LIN). An assumption typically made in single-subject studies (with large amounts of data in individual subjects) is that the effects observed and reported in papers are robust in individual subjects. More than one subject is usually included to hint that this is the case. This is an intriguing approach. However, reports of effects that are statistically significant in some subjects and some ROIs are difficult to interpret. This, in my view, runs contrary to the logic and leverage of the single-subject approach. Reporting results that are only significant in 1 out of 3 subjects and inferring general conclusions from this seems less convincing.

      The first main finding is stated as being that '128 dimensions are sufficient to explain a sizeable portion of the brain activity' (L379). What qualifies this? From my understanding, only models of that dimensionality were tested. They explain a sizeable portion of brain activity, but it is difficult to follow what 'sizable' is without baseline models that estimate a prediction floor and ceiling. For instance, would autoencoders that reconstruct any spectrogram (not just voice) also predict a sizable portion of the measured activity? What happens to reconstruction results as the dimensionality is varied?

      A second main finding is stated as being that the 'VLS outperforms the LIN space' (L381). It seems correct that the VAE yields more natural-sounding reconstructions, but this is a technical feature of the chosen autoencoding approach. That the VLS yields a 'more brain-like representational space' I assume refers to the RDM results where the RDM correlations were mainly significant in one subject. For classification, the performance of features from the reconstructions (age/ gender/ identity) gives results that seem more mixed, and it seems difficult to draw a general conclusion about the VLS being better. It is not clear that this general claim is well supported.

      It is not clear why the RDM was not formed based on the 'stimulus GLM' betas. The 'identity GLM' is already biased towards identity and it would be stronger to show associations at the stimulus level.

      Multiple comparisons were performed across ROIs, models, subjects, and features in the classification analyses, but it is not clear how correction for these multiple comparisons was implemented in the statistical tests on classification accuracies.

      Risks of overfitting and bias are a recurrent challenge in stimulus reconstruction with fMRI. It would be good with more control analyses to ensure that this was not the case. For instance, how were the repeated test stimuli presented? Were they intermingled with the other stimuli used for training or presented in separate runs? If intermingled, then the training and test data would have been preprocessed together, which could compromise the test set. The reconstructions could be performed on responses from independent runs, preprocessed separately, as a control. This should include all preprocessing, for instance, estimating stimulus/identity GLMs on separately processed run pairs rather than across all runs. Also, it would be good to avoid detrending before GLM denoising (or at least testing its effects) as these can interact.

      We appreciate Reviewer #2’s careful reading and numerous suggestions for improving clarity and presentation. We have implemented the suggested text edits, corrected ambiguities, and clarified methodological details throughout the manuscript. In particular, we have toned down several sentences that we agree were making strong claims (L72, L118, L378, L380-381).

      Clarifications, corrections and additional information:

      We streamlined the introduction by reducing overly specific details and better framing the VLS concept before presenting specifics.

      Clarified the motivation for the age classification split and corrected several inaccuracies and ambiguities in the methods, including the hearing thresholds, balancing of category levels, and stimulus energy selection procedure.

      Provided additional information on the temporal structure of runs and experimental stimuli selection.

      Corrected the description of technical issues affecting one participant and ensured all acronyms are properly defined in the text and figure legends.

      Confirmed that audiograms were performed repeatedly to monitor hearing thresholds and clarified our use of robust scaling and normalization procedures.

      Regarding the test of RDM correlations, we clarified in the text that multiple comparisons were corrected using a permutation-based framework.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, Lamothe et al. sought to identify the neural substrates of voice identity in the human brain by correlating fMRI recordings with the latent space of a variational autoencoder (VAE) trained on voice spectrograms. They used encoding and decoding models, and showed that the "voice" latent space (VLS) of the VAE performs, in general, (slightly) better than a linear autoencoder's latent space. Additionally, they showed dissociations in the encoding of voice identity across the temporal voice areas.

      Strengths:

      The geometry of the neural representations of voice identity has not been studied so far. Previous studies on the content of speech and faces in vision suggest that such geometry could exist. This study demonstrates this point systematically, leveraging a specifically trained variational autoencoder. 

      The size of the voice dataset and the length of the fMRI recordings ensure that the findings are robust.

      Weaknesses:

      Overall, the VLS is often only marginally better than the linear model across analysis, raising the question of whether the observed performance improvements are due to the higher number of parameters trained in the VAE, rather than the non-linearity itself. A fair comparison would necessitate that the number of parameters be maintained consistently across both models, at least as an additional verification step.

      The encoding and RSM results are quite different. This is unexpected, as similar embedding geometries between the VLS and the brain activations should be reflected by higher correlation values of the encoding model.

      The consistency across participants is not particularly high, for instance, S1 seemed to have demonstrated excellent performances, while S2 showed poor performance.

      An important control analysis would be to compare the decoding results with those obtained by a decoder operating directly on the latent spaces, in order to further highlight the interest of the non-linear transformations of the decoder model. Currently, it is unclear whether the non-linearity of the decoder improves the decoding performance, considering the poor resemblance between the VLS and brain-reconstructed spectrograms.

      We thank Reviewer #3 for their comments. In response:

      Code and preprocessed data are now available as indicated in the revised manuscript.

      While we appreciate the suggestion to display supplementary analyses as boxplots split by hemisphere, we opted to retain the current format as we do not have hypotheses regarding hemispheric lateralization, and the small sample size per hemisphere would preclude robust conclusions.

      Confirmed that the identities in Figure 3a are indeed ordered by age and have clarified this in the legend.

      The higher variance observed in correlations for the aTVA in Figure 3b reflects the small number of data points (3 participants × 2 hemispheres), and this is now explained.

      Regarding the cerebral encoding of gender and age, we acknowledge this interesting pattern. Prior work (e.g., Charest et al., 2013) found overlapping processing regions for voice gender without clear subregional differences in the TVAs. Evidence on voice age encoding remains sparse, and we highlight this novel finding in our discussion.

      We again thank the reviewers for their insightful comments, which have greatly improved the quality and clarity of our work.

      Reviewer #1 (Recommendations For The Authors):

      (1) A set of recent advances have shown that embeddings of unsupervised/self-supervised speech models aligned to auditory responses to speech in the temporal cortex (e.g. Wav2Vec2: Millet et al NeurIPS 2022; HuBERT: Li et al. Nat Neurosci 2023; Whisper: Goldstein et al.bioRxiv 2023). These models are known to preserve a variety of speech information (phonetics, linguistic information, emotions, speaker identity, etc) and perform well in a variety of downstream tasks. These other models should be evaluated or at least discussed in the study. 

      We fully agree - the pace of progress in this area of voice technology has been incredible. Many of these models were not yet available at the time this work started so we could not use them in our comparison with cerebral representations.

      We have now implemented Reviewer #1’s suggestion and evaluated Wav2Vec and HuBERT. The results are presented in supplementary Figure 2-S3. Correlations between activity predicted by the model and the real activity were globally comparable with those obtained with the LIN and VLS models. Interestingly both HuBERT and Wav2Vec yielded highest correlations in the mTVA, and to a lesser extent, the aTVA, as the LIN and VLS models.

      (2) The test statistics of the results in Fig 1c-e need to be revised. Given that logistic regression is a convex optimization problem typically converging to a global optimum, these multiple initializations of the classifier were likely not entirely independent. Consequently, the reported degrees of freedom and the effect size estimates might not accurately reflect the true variability and independence of the classifier outcomes. A more careful evaluation of these aspects is necessary to ensure the statistical robustness of the results. 

      We thank Reviewer #1 for pointing out this important issue regarding the potential dependence between multiple runs of the logistic regression model. To address this concern, we have revised our analyses and used a Wilcoxon signed-rank test to compare the decoding accuracy to chance level. The results showed that the accuracy was significantly above chance for all classifications (Wilcoxon signed-rank test, all W=15, p=0.03125). We updated Figure 1c-e and the corresponding text (L154-L155) to reflect the revised analysis. Because the focus of this section is to probe the informational content of the autoencoder’s latent spaces, and since there are only 5 decoding accuracy values per model, we dropped the inter-model statistical test.

      (3) In Line 198, the authors discuss the number of dimensions used in their models. To provide a comprehensive comparison, it would be informative to include direct decoding results from the original spectrograms alongside those from the VLS and LIN models. Given the vast diversity in vocal speech characteristics, it is plausible that the speaker identities might correlate with specific speech-related features also represented in both the auditory cortex and the VLS. Therefore, a clearer understanding of the original distribution of voice identities in the untransformed auditory space would be beneficial. This addition would help ascertain the extent to which transformations applied by the VLS or LIN models might be capturing or obscuring relevant auditory information.

      We have now implemented Reviewer #1’s suggestion. The graphs on the right panel b of revised Figure 4 now show decoding results obtained from the regression performed directly on the spectrograms, rather than on representations of them, for our two example test stimuli. They can be listened to and compared to the LIN- and VLS-based reconstructions in Supplementary Audio 2. Compared to the LIN and VLS, the SPEC-based reconstructions sounded much less vocal or similar to the original, indicating that the latent spaces indeed capture more abstract voice representations, more similar to cerebral ones.

      Reviewer #2 (Recommendations For The Authors): 

      L31: 'in voice' > consider rewording (from a voice?).

      L33: consider splitting sentence (after interactions). 

      L39: 'brain' after parentheses. 

      L45-: certainly DNNs 'as a powerful tool' extend to audio (not just image and video) beyond their use in brain models. 

      L52: listened to / heard. 

      L63: use second/s consistently. 

      L64: the reference to Figure 5D is maybe a bit confusing here in the introduction. 

      We thank Reviewer #2 for these recommendations, which we have implemented.

      L79-88: this section is formulated in a way that is too detailed for the introduction text (confusing to read). Consider a more general introduction to the VLS concept here and the details of this study later. 

      L99-: again, I think the experimental details are best saved for later. It's good to provide a feel for the analysis pipeline here, but some of the details provided (number of averages, denoising, preprocessing), are anyway too unspecific to allow the reader to fully follow the analysis. 

      Again, thank you for these suggestions for improving readability: we have modified the text accordingly.

      L159: what was the motivation for classifying age as a 2-class classification problem? Rather than more classes or continuous prediction? How did you choose the age split? 

      The motivation for the 2 age classes was to align on the gender classification task for better comparison. The cutoff (30 years) was not driven by any scientific consideration, but by practical ones, based on the median age in our stimulus set. This is now clarified in the manuscript (L149).

      L263: Is the test of RDM correlation>0 corrected for multiple comparisons across ROIs, subjects, and models?

      The test of RDM correlation>0 was indeed corrected for multiple comparisons for models using the permutation-based ‘maximum statistics’ framework for multiple comparison correction (described in Giordano et al., 2023 and Maris & Oostenveld, 2007). This framework was applied for each ROI and subject. It was described in the Methods (L745) but not clearly enough in the text—we thank Reviewer #2 and clarified it in the text (L246, L260-L261).

      L379: 'these stimuli' - weren't the experimental stimuli different from those used to train the V/AE? 

      We thank Reviewer #2 for spotting this issue. Indeed, the experimental stimuli are different from those used to train the models. We corrected the text to reflect this distinction (L84-L85).

      L443: what are 'technical issues' that prevented subject 3 from participating in 48 runs?? 

      We thank Reviewer #2 for pointing out the ambiguity in our previous statement. Participant 3 actually experienced personal health concerns that prevented them from completing the whole number of runs. We corrected this to provide a more accurate description (L442-L443).

      L444: participants were instructed to 'stay in the scanner'!? Do you mean 'stay still', or something? 

      We thank the Reviewer for spotting this forgotten word. We have corrected the passage (L444).

      L463: Hearing thresholds of 15 dB: do you mean that all had thresholds lower than 15 dB at all frequencies and at all repeated audiogram measurements? 

      We thank Reviewer #2 for spotting this error: we meant thresholds below 15dB HL. This has been corrected (L463). Indeed participants were submitted to several audiograms between fMRI sessions, to ensure no hearing loss could be caused by the scanner noise in these repeated sessions.

      L472: were the 4 category levels balanced across the dataset (in number of occurrences of each category combination)? 

      The dataset was fully balanced, with an equal number of samples for each combination of language, gender, age, and identity. Furthermore, to minimize potential adaptation effects, the stimuli were also balanced within each run according to these categories, and identity was balanced across sessions. We made this clearer in Main voice stimuli (L492-L496).

      L482: the test stimuli were selected as having high energy by the amplitude envelope. It is unclear what this means (how is the envelope extracted, what feature of it is used to measure 'high energy'?) 

      The selection of sounds with high energy was based on analyzing the amplitude envelope of each signal, which was extracted using the Hilbert transform and then filtered to refine the envelope. This envelope, which represents the signal's intensity over time, was used to measure the energy of each stimulus, and those that exceeded an arbitrary threshold were selected. From this pool of high-energy stimuli, likely including vowels, we selected six stimuli to be repeated during the scanning session, then reconstructed via decoding. This has been clarified in the text (L483-L484). 

      L500 was the audio filtered to account for the transfer function of the Sensimetrics headphones? 

      We did not perform any filtering, as the transfer function of the Sensimetrics is already very satisfactory as is. This has been clarified in the text (L503).

      L500: what does 'comfortable level' correspond to and was it set per session (i.e. did it vary across sessions)? 

      By comfortable we mean around 85 dB SPL. The audio settings were kept similar across sessions. This has been added to the text (L504).

      L526- does the normalization imply that the reconstructed spectrograms are normalized? Were the reconstructions then scaled to undo the normalization before inversion? 

      The paragraph on spectrogram standardization was not well placed inducing confusion. We have placed this paragraph in its more suitable location, in the Deep learning section (L545L550)

      L606: does the identity GLM model the denoised betas from the first GLM or simply the BOLD data? The text indicates the latter, but I suspect the former. 

      Indeed: this has been clarified (L601-L602).

      L704: could you unpack this a bit more? It is not easy to see why you specify the summing in the objective. Shouldn't this just be the ridge objective for a given voxel/ROI? Then you could just state it in matrix notation. 

      Thanks for pointing this out: we kept the formula unchanged but clarified the text, in particular specified that the voxel id is the ith index (L695).

      L716: you used robust scaling for the classifications in latent space but haven't mentioned scaling here. Are we to assume that the same applies?  

      Indeed we also used robust scaling here, this is now made clear (L710-L711).

      L720: Pearson correlation as a performance metric and its variance will depend on the choice of test/train split sizes. Can you show that the results generalize beyond your specific choices? Maybe the report explained variance as well to get a better idea of performance. 

      We used a standard 80/20 split. We think it is beyond the scope of this study to examine the different possible choices of splits, and prefer not to spend additional time on this point which we think is relatively minor.

      Could you specify (somewhere) the stimulus timing in a run? ISI and stimulus duration are mentioned in different places, but it would be nice to have a summary of the temporal structure of runs.

      This is now clarified at the beginning of the Methods section (L437-441)

      Reviewer #3 (Recommendations For The Authors):

      Code and data are not currently available. 

      Code and preprocessed data are now available (L826-827).

      In the supplementary material, it would be beneficial to present the different analyses as boxplots, as in the main text, but with the ROIs in the left and right hemispheres separated, to better show potential hemispheric effect. Although this information is available in the Supplementary Tables, it is currently quite tedious to access it. 

      Although we provide the complete data split by hemisphere in the Tables, we do not believe it is relevant to illustrate left/right differences, as we do not have any hypotheses regarding hemispheric lateralization–and we would be underpowered in any case to test them with only three points by hemisphere.

      In Figure 3a, it might be beneficial to order the identities by age for each gender in order to more clearly illustrate the structure of the RDMs,  

      The identities are indeed already ordered by increasing age: we now make this clear.

      In Figure 3b, the variance for the correlations for the aTVA is higher than in other regions, why? 

      Please note that the error bar indicates variance across only 6 data points (3 subjects x 2 hemispheres) such that some fluctuations are to be expected.

      Please make sure that all acronyms are defined, and that they are redefined in the figure legends. 

      This has been done.

      Gender and age are primarily encoded by different brain regions (Figure 5, pTVA vs aTVA). How does this finding compare with existing literature?

      This interesting finding was not expected. The cerebral processing of voice gender has been investigated by several groups including ours (Charest et al., 2013, Cerebral Cortex). Using an fMRI-adaptation design optimized using a continuous carry-over protocol and voice gender continua generated by morphing, we found that regions dealing with acoustical differences between voices of varying gender largely overlapped with the TVAs, without clear differentiation between the different subparts. Evidence for the role of the different TVAs in voice age processing remains scarce.

    1. Author response:

      Reviewer #1 (Public review):

      (1) It might be good to further discuss potential molecular mechanisms for increasing the TF off rate (what happens at the mechanistic level). 

      This is now expanded in the Discussion

      (2) To improve readability, it would be good to make consistent font sizes on all figures to make sure that the smallest font sizes are readable. 

      We have normalised figure text as much as is feasible.

      (3) upDARs and downDARs - these abbreviations are defined in the figure legend but not in the main text. 

      We have removed references to these terms from the text and included a definition in the figure legend. 

      (4) Figure 3B - the on-figure legend is a bit unclear; the text legend does not mention the meaning of "DEG". 

      We have removed this panel as it was confusing and did not demonstrate any robust conclusion. 

      (5) The values of apparent dissociation rates shown in Figure 5 are a bit different from values previously reported in literature (e.g., see Okamoto et al., 20203, PMC10505915). Perhaps the authors could comment on this. Also, it would be helpful to add the actual equation that was used for the curve fitting to determine these values to the Methods section. 

      We have included an explanation of the curve fitting equation in the Methods as suggested.

      The apparent dissociation rate observed is a sum of multiple rates of decay – true dissociation rate (𝑘<sub>off</sub>), signal loss caused by photobleaching 𝑘<sub>pb</sub>, and signal loss caused by defocusing/tracking error (𝑘<sub>tl</sub>).

      k<sub>off</sub><sup>app</sup>= k<sub>off</sub> + K<sub>pb</sub> + k<sub>tl</sub>

      We are making conclusions about relative changes in k<sub>off</sub><sup>app</sup> upon CHD4 depletion, not about the absolute magnitude of true k<sub>off</sub> or TF residence times. Our conclusions extend to true k<sub>off</sub> based on the assumption that K<sub>pb</sub> and k<sub>tl</sub> are equal across all samples imaged due to identical experimental conditions and analysis.

      K<sub>pb</sub> and k<sub>tl</sub> vary hugely across experimental set-ups, especially with diZerent laser powers, so other k<sub>off</sub> or k<sub>off</sub><sup>app</sup> values reported in the literature would be expected to diZer from ours. Time-lapse experiments or independent determination of K<sub>pb</sub> (and k<sub>tl</sub>) would be required to make any statements about absolute values of k<sub>off</sub>.

      (6) Regarding the discussion about the functionality of low-affinity sites/low accessibility regions, the authors may wish to mention the recent debates on this (https://www.nature.com/articles/s41586-025-08916-0; https://www.biorxiv.org/content/10.1101/2025.10.12.681120v1). 

      We have now included a discussion of this point and referenced both papers.

      (7) It may be worth expanding figure legends a bit, because the definitions of some of the terms mentioned on the figures are not very easy to find in the text. 

      We have endeavoured to define all relevant terms in the figure legends. 

      Reviewer #2 (Public review): 

      (1) Figure 2 shows heat maps of RNA-seq results following a time course of CHD4 depletion (0, 1, 2 hours...). Usually, the red/blue colour scale is used to visualise differential expression (fold-difference). Here, genes are coloured in red or blue even at the 0-hour time point. This confused me initially until I discovered that instead of folddifference, a z-score is plotted. I do not quite understand what it means when a gene that is coloured blue at the 0-hour time point changes to red at a later time point. Does this always represent an upregulation? I think this figure requires a better explanation. 

      The heatmap displays z-scores, meaning expression for each gene has been centred and scaled across the entire time course. As a result, time zero is not a true baseline, it simply shows whether the gene’s expression at that moment is above or below its own mean. A transition from blue to red therefore indicates that the gene increases relative to its overall average, which typically corresponds to upregulation, but it doesn’t directly represent fold-change from the 0-hour time point. We have now included a brief explanation of this in the figure legend to make this point clear.  

      (2) Figure 5D: NANOG, SOX2 binding at the KLF4 locus. The authors state that the enhancers 68, 57, and 55 show a gain in NANOG and SOX2 enrichment "from 30 minutes of CHD4 depletion". This is not obvious to me from looking at the figure. I can see an increase in signal from "WT" (I am assuming this corresponds to the 0 hours time point) to "30m", but then the signals seem to go down again towards the 4h time point. Can this be quantified? Can the authors discuss why TF binding seems to increase only temporarily (if this is the case)? 

      We have edited the text to more accurately reflect what is going on in the screen shot. We have also replaced “WT” with “0” as this more accurately reflects the status of these cells. 

      (3) The is no real discussion of HOW CHD4/NuRD counteracts TF binding (i.e. by what molecular mechanism). I understand that the data does not really inform us on this. Still, I believe it would be worthwhile for the authors to discuss some ideas, e.g., local nucleosome sliding vs. a direct (ATP-dependent?) action on the TF itself. 

      We now include more speculation on this point in the Discussion.

      Reviewer #3 (Public review): 

      The main weakness can be summarised as relating to the fact that authors interpret all rapid changes following CHD4 degradation as being a direct effect of the loss of CHD4 activity. The possibility that rapid indirect effects arise does not appear to have been given sufficient consideration. This is especially pertinent where effects are reported at sites where CHD4 occupancy is initially low. 

      We acknowledge that we cannot definitively say any effect is a direct consequence of CHD4 depletion and have mitigated statements in the Results and Discussion. 

      Reviewing Editor Comments: 

      I am pleased to say all three experts had very complementary and complimentary comments on your paper - congratulations. Reviewer 3 does suggest toning down a few interpretations, which I suggest would help focus the manuscript on its greater strengths. I encourage a quick revision to this point, which will not go back to reviewers, before you request a version of record. I would also like to take this opportunity to thank all three reviewers for excellent feedback on this paper. 

      As advised we have mitigated the points raised by the reviewers.

    1. I also realized that if design was problem solving, then we all design to some degree. When you rearrange your room to better access your clothes, you’re doing interior design. When you create a sign to remind your roommates about their chores, you’re doing information design. When you make a poster or a sign for a club, you’re doing graphic design. We may not do any of these things particularly well or with great expertise, but each of these is a design enterprise that has the capacity for expertise and skill

      I like how this reading reframed design as problem-solving rather than just visuals, because I used to think design was mostly about how things look. I also agree with the idea that everyone designs in some way, even if it isn’t professional, because it makes design feel less exclusive and more like a skill anyone can grow. The discussion about power and design justice stood out to me, and it made me think more about who gets left out when only certain people make decisions for everyone else.

    1. Harassment can also be done through crowds. Crowd harassment has also always been a part of culture, such as riots, mob violence, revolts, revolution, government persecution, etc. Social media then allows new ways for crowd harassment to occur. Crowd harassment includes all the forms of individual harassment we already mentioned (like bullying, stalking, etc.), but done by a group of people. Additionally, we can consider the following forms of crowd harassment:

      I've seen many instances where people on social media will ban together to harass individuals or businesses. While often it's because these people/ businesses did something to provoke it (such as go something offensive, or offend a costumer), sometimes it can also purely be because a person online posted a video, story, tweet, etc. to tell people to go harass that person, and people than ban wagon together to do so. For example, an influencer may see someone has posted something critiquing them online, and send their fans on harass that person. While usually this isn't something where one side is completely in the right, I think online harassment as a whole is morally wrong.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      This work provides important new evidence of the cognitive and neural mechanisms that give rise to feelings of shame and guilt, as well as their transformation into compensatory behavior. The authors use a well-designed interpersonal task to manipulate responsibility and harm, eliciting varying levels of shame and guilt in participants. The study combines behavioral, computational, and neuroimaging approaches to offer a comprehensive account of how these emotions are experienced and acted upon. Notably, the findings reveal distinct patterns in how harm and responsibility contribute to guilt and shame and how these factors are integrated into compensatory decision-making.

      Strengths

      (1) Investigating both guilt and shame in a single experimental framework allows for a direct comparison of their behavioral and neural effects while minimizing confounds.

      (2) The study provides a novel contribution to the literature by exploring the neural bases underlying the conversion of shame into behavior.

      (3) The task is creative and ecologically valid, simulating a realistic social situation while retaining experimental control.

      (4) Computational modeling and fMRI analysis yield converging evidence for a quotient-based integration of harm and responsibility in guiding compensatory behavior.

      We are grateful for your thoughtful summary of our work’s strengths and greatly appreciate these positive words.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      (1) Post-experimental self-reports rely both on memory and on the understanding of the conceptual difference between the two emotions. Additionally, it is unclear whether the 16 scenarios were presented in random order; sequential presentation could have introduced contrast effects or demand characteristics.

      Thank you for pointing out the two limitations of the experimental paradigm. We fully agree with your point. Participants recalled and reported their feelings of guilt and shame immediately after completing the task, which likely ensured reasonably accurate state reports. We acknowledge, however, that in-task assessments might provide greater precision. We opted against them to examine altruistic decision-making in a more natural context, as in-task assessments could have heightened participants’ awareness of guilt and shame and biased their altruistic decisions. Post-task assessments also reduced fMRI scanning time, minimizing discomfort from prolonged immobility and thereby preserving data quality.

      In the present study, assessing guilt and shame required participants to distinguish conceptually between the two emotions. Most research with adult participants has adopted this approach, relying on direct self-reports of emotional intensity under the assumption that adults can differentiate between guilt and shame (Michl et al., 2014; Wagner et al., 2011; Zhu et al., 2019). However, we acknowledge that this approach may be less suitable for studies involving children, who may not yet have a clear understanding of the distinction between guilt and shame.

      The limitations have been added into the Discussion section (Page 47): “This research has several limitations. First, post-task assessments of guilt and shame, unlike in-task assessments, rely on memory and may thus be less precise, although in-task assessments could have heightened participants’ awareness of these emotions and biased their decisions. Second, our measures of guilt and shame depend on participants’ conceptual understanding of the two emotions. While this is common practice in studies with adult participants (Michl et al., 2014; Wagner et al., 2011; Zhu et al., 2019), it may be less appropriate for research involving children.”

      We apologize for the confusion. The 16 scenarios were presented in a random order. We have clarified this in the revised manuscript (Page 13): “After the interpersonal game, the outcomes of the experimental trials were re-presented in a random order.”

      (2) In the neural analysis of emotion sensitivity, the authors identify brain regions correlated with responsibility-driven shame sensitivity and then use those brain regions as masks to test whether they were more involved in the responsibility-driven shame sensitivity than the other types of emotion sensitivity. I wonder if this is biasing the results. Would it be better to use a cross-validation approach? A similar issue might arise in "Activation analysis (neural basis of compensatory sensitivity)." 

      Thank you for this valuable comment. We replaced the original analyses with a leave-one-subject-out (LOSO) cross-validation approach, which minimizes bias in secondary tests due to non-independence (Esterman et al., 2010). The findings were largely consistent with the original results, except that two previously significant effects became marginally significant (one effect changed from P = 0.012 to P = 0.053; the other from P = 0.044 to P = 0.062). Although we believe the new results do not alter our main conclusions, marginally significant findings should be interpreted with caution. We have noted this point in the Discussion section (Page 48): “… marginally significant results should be viewed cautiously and warrant further examination in future studies with larger sample sizes.”

      In the revised manuscript, we have described the cross-validation procedure in detail and reported the corresponding results. Please see the Method section, Page 23: “The results showed that the neural responses in the temporoparietal junction/superior temporal sulcus (TPJ/STS) and precentral cortex/postcentral cortex/supplementary motor area (PRC/POC/SMA) were negatively correlated with the responsibility-driven shame sensitivity. To test whether these regions were more involved in responsibilitydriven shame sensitivity than in other types of emotion sensitivity, we implemented a leave-one-subject-out (LOSO) cross-validation procedure (e.g., Esterman et al., 2010). In each fold, clusters in the TPJ/STS and PRC/POC/SMA showing significant correlations with responsibility-driven shame sensitivity were identified at the group level based on N-1 participants. These clusters, defined as regions of interest (ROI), were then applied to the left-out participant, from whom we extracted the mean parameter estimates (i.e., neural response values). If, in a given fold, no suprathreshold cluster was detected within the TPJ/STS or PRC/POC/SMA after correction, or if the two regions merged into a single cluster that could not be separated, the corresponding value was coded as missing. Repeating this procedure across all folds yielded an independent set of ROI-based estimates for each participant. In the LOSO crossvalidation procedure, the TPJ/STS and PRC/POC/SMA merged into a single inseparable cluster in two folds, and no suprathreshold cluster was detected within the TPJ/STS in one fold. These instances were coded as missing, resulting in valid data from 39 participants for the TPJ/STS and 40 participants for the PRC/POC/SMA. We then correlated these estimates with all four types of emotion sensitivities and compared the correlation with responsibility-driven shame sensitivity against those with the other sensitivities using Z tests (Pearson and Filon's Z).” and Page 24: “To directly test whether these regions were more involved in one of the two types of compensatory sensitivity, we applied the same LOSO cross-validation procedure described above. In this procedure, no suprathreshold cluster was detected within the LPFC in one fold and within the TP in 27 folds. These cases were coded as missing, resulting in valid data from 42 participants for the bilateral IPL, 41 participants for the LPFC, and 15 participants for the TP. The limited sample size for the TP likely reflects that its effect was only marginally above the correction threshold, such that the reduced power in cross-validation often rendered it nonsignificant. Because the sample size for the TP was too small and the results may therefore be unreliable, we did not pursue further analyses for this region. The independent ROI-based estimates were then correlated with both guilt-driven and shame-driven compensatory sensitivities, and the strength of the correlations was compared using Z tests (Pearson and Filon's Z).”

      Please see the Results section, Pages 34 and 35: “To assess whether these brain regions were specifically involved in responsibility-driven shame sensitivity, we compared the Pearson correlations between their activity and all types of emotion sensitivities. The results demonstrated the domain specificity of these regions, by revealing that the TPJ/STS cluster had significantly stronger negative responses to responsibility-driven shame sensitivity than to responsibility-driven guilt sensitivity (Z = 2.44, P = 0.015) and harm-driven shame sensitivity (Z = 3.38, P < 0.001), and a marginally stronger negative response to harm-driven guilt sensitivity (Z = 1.87, P = 0.062) (Figure 4C; Supplementary Table 14). In addition, the sensorimotor areas (i.e., precentral cortex (PRC), postcentral cortex (POC), and supplementary motor area (SMA)) exhibited the similar activation pattern as the TPJ/STS (Figure 4B and 4C; Supplementary Tables 13 and 14).” and Page 35: “The results revealed that the left LPFC was more engaged in shame-driven compensatory sensitivity (Z = 1.93, P = 0.053), as its activity showed a marginally stronger positive correlation with shamedriven sensitivity than with guilt-driven sensitivity (Figure 5C). No significant difference was found in the Pearson correlations between the activity of the bilateral IPL and the two types of sensitivities (Supplementary Table 16). For the TP, the effective sample size was too small to yield reliable results (see Methods).”

      (1) Regarding the traits of guilt and shame, I appreciate using the scores from the subscales (evaluations and action tendencies) separately for the analyses (instead of a composite score). An issue with using the actions subscales when measuring guilt and shame proneness is that the behavioral tendencies for each emotion get conflated with their definitions, risking circularity. It is reassuring that the behavior evaluation subscale was significantly correlated with compensatory behavior (not only the action tendencies subscale). However, the absence of significant neural correlates for the behavior evaluation subscale raises questions: Do the authors have thoughts on why this might be the case, and any implications?

      We are grateful for this important comment. According to the Guilt and Shame Proneness Scale, trait guilt comprises two dimensions: negative behavior evaluations and repair action tendencies (Cohen et al., 2011). Behaviorally, both dimensions were significantly correlated with participants’ compensatory behavior (negative behavior evaluations: R = 0.39, P = 0.010; repair action tendencies: R = 0.33, P = 0.030). Neurally, while repair action tendencies were significantly associated with activity in the aMCC and other brain areas, negative behavior evaluations showed no significant neural correlates. The absence of significant neural correlates for negative behavior evaluations may be due to several factors. In addition to common explanations (e.g., limited sample size reducing the power to detect weak neural correlates or subtle effects obscured by fMRI noise), another possibility is that this dimension influences neural responses indirectly through intermediate processes not captured in our study (e.g., specific motivational states). We have added a discussion of the non-significant result to the revised manuscript (Page 47): “However, the neural correlates of negative behavior evaluations (another dimension of trait guilt) were absent. The reasons underlying the non-significant neural finding may be multifaceted. One possibility is that negative behavior evaluations influence neural responses indirectly through intermediate processes not captured in our study (e.g., specific motivational states).”

      In addition, to avoid misunderstanding, the revised manuscript specifies at the appropriate places that the neural findings pertain to repair action tendencies rather than to trait guilt in general. For instance, see Pages 46 and 47: “Furthermore, we found neural responses in the aMCC mediated the relationship between repair action tendencies (one dimension of trait guilt) and compensation… Accordingly, our fMRI findings suggest that individuals with stronger tendency to engage in compensation across various moral violation scenarios (indicated by their repair action tendencies) are more sensitive to the severity of the violation and therefore engage in greater compensatory behavior.”

      (2) Regarding the computational model finding that participants seem to disregard selfinterest, do the authors believe it may reflect the relatively small endowment at stake? Do the authors believe this behavior would persist if the stakes were higher?

      Additionally, might the type of harm inflicted (e.g., electric shock vs. less stigmatized/less ethically charged harm like placing a hand in ice-cold water) influence the weight of self-interest in decision-making?

      Taken together, the conclusions of the paper are well supported by the data. It would be valuable for future studies to validate these findings using alternative tasks or paradigms to ensure the robustness and generalizability of the observed behavioral and neural mechanisms.

      Thank you for these important questions. As you suggested, we believe that the relatively small personal stakes in our task (a maximum loss of 5 Chinese yuan) likely explain why the computational model indicated that participants disregarded selfinterest. We also agree that when the harm to others is less morally charged, people may be more inclined to consider self-interest in compensatory decision-making. Overall, the more stigmatized the harm and the smaller the personal stakes, the more likely individuals are to disregard self-interest and focus solely on making appropriate compensation.

      We have added the following passage to the Discussion section (Page 42): “Notably, in many computational models of social decision-making, self-interest plays a crucial role (e.g., Wu et al., 2024). However, our computational findings suggest that participants disregarded self-interest during compensatory decision-making. A possible explanation is that the personal stakes in our task were relatively small (a maximum loss of 5 Chinese yuan), whereas the harm inflicted on the receiver was highly stigmatized (i.e., an electric shock). Under conditions where the harm is highly salient and the cost of compensation is low, participants may be inclined to disregard selfinterest and focus solely on making appropriate compensation.”

      Reviewer #2 (Public review):

      Summary

      The authors combined behavioral experiments, computational modeling, and functional magnetic resonance imaging (fMRI) to investigate the psychological and neural mechanisms underlying guilt, shame, and the altruistic behaviors driven by these emotions. The results revealed that guilt is more strongly associated with harm, whereas shame is more closely linked to responsibility. Compared to shame, guilt elicited a higher level of altruistic behavior. Computational modeling demonstrated how individuals integrate information about harm and responsibility. The fMRI findings identified a set of brain regions involved in representing harm and responsibility, transforming responsibility into feelings of shame, converting guilt and shame into altruistic actions, and mediating the effect of trait guilt on compensatory behavior.

      Strengths

      This study offers a significant contribution to the literature on social emotions by moving beyond prior research that typically focused on isolated aspects of guilt and shame. The study presents a comprehensive examination of these emotions, encompassing their cognitive antecedents, affective experiences, behavioral consequences, trait-level characteristics, and neural correlates. The authors have introduced a novel experimental task that enables such a systematic investigation and holds strong potential for future research applications. The computational modeling procedures were implemented in accordance with current field standards. The findings are rich and offer meaningful theoretical insights. The manuscript is well written, and the results are clearly and logically presented.

      We are thankful for your considerate acknowledgment of our work’s strengths and truly value your positive comments.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      In this study, participants' feelings of guilt and shame were assessed retrospectively, after they had completed all altruistic decision-making tasks. This reliance on memorybased self-reports may introduce recall bias, potentially compromising the accuracy of the emotion measurements.

      Thank you for this crucial comment. We fully agree that measuring guilt and shame after the task may affect accuracy to some extent. However, because participants reported their emotions immediately after completing the task, we believe their recollections were reasonably accurate. In designing the experiment, we considered intask assessments, but this approach risked heightening participants’ awareness of guilt and shame and thereby interfering with compensatory decisions. After careful consideration, we ultimately chose post-task assessments of these emotions. A similar approach has been adopted in prior research on gratitude, where post-task assessments were also used (Yu et al., 2018).

      In the revised manuscript, we have specified the limitations of both post-task and intask assessments of guilt and shame (Page 47): “… post-task assessments of guilt and shame, unlike in-task assessments, rely on memory and may thus be less precise, although in-task assessments could have heightened participants’ awareness of these emotions and biased their decisions.”.

      In many behavioral economic models, self-interest plays a central role in shaping individual decision-making, including moral decisions. However, the model comparison results in this study suggest that models without a self-interest component (such as Model 1.3) outperform those that incorporate it (such as Model 1.1 and Model 1.2). The authors have not provided a satisfactory explanation for this counterintuitive finding. 

      Thank you for this important comment. In the revised manuscript, we have provided a possible explanation (Page 42): “Notably, in many computational models of social decision-making, self-interest plays a crucial role (e.g., Wu et al., 2024). However, our computational findings suggest that participants disregarded self-interest during compensatory decision-making. A possible explanation is that the personal stakes in our task were relatively small (a maximum loss of 5 Chinese yuan), whereas the harm inflicted on the receiver was highly stigmatized (i.e., an electric shock). Under conditions where the harm is highly salient and the cost of compensation is low, participants may be inclined to disregard self-interest and focus solely on making appropriate compensation.”

      The phrases "individuals integrate harm and responsibility in the form of a quotient" and "harm and responsibility are integrated in the form of a quotient" appear in the Abstract and Discussion sections. However, based on the results of the computational modeling, it is more accurate to state that "harm and the number of wrongdoers are integrated in the form of a quotient." The current phrasing misleadingly suggests that participants represent information as harm divided by responsibility, which does not align with the modeling results. This potentially confusing expression should be revised for clarity and accuracy.

      We sincerely thank you for this helpful suggestion and apologize for the confusion caused. We have removed expressions such as “harm and responsibility are integrated in the form of a quotient” from the manuscript. Instead, we now state more precisely that “harm and the number of wrongdoers are integrated in the form of a quotient.”

      However, in certain contexts we continue to discuss harm and responsibility. Introducing “the number of wrongdoers” in these places would appear abrupt, so we have opted for alternative phrasing. For example, on Page 3, we now write:

      “Computational modeling results indicated that the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.” Similarly, on Page 49, we state: “Notably, harm and responsibility are integrated in a manner consistent with responsibility diffusion prior to influencing guilt-driven and shame-driven compensation.”

      In the Discussion, the authors state: "Since no brain region associated with social cognition showed significant responses to harm or responsibility, it appears that the human brain encodes a unified measure integrating harm and responsibility (i.e., the quotient) rather than processing them as separate entities when both are relevant to subsequent emotional experience and decision-making." However, this interpretation overstates the implications of the null fMRI findings. The absence of significant activation in response to harm or responsibility does not necessarily imply that the brain does not represent these dimensions separately. Null results can arise from various factors, including limitations in the sensitivity of fMRI. It is possible that more finegrained techniques, such as intracranial electrophysiological recordings, could reveal distinct neural representations of harm and responsibility. The interpretation of these null findings should be made with greater caution.

      Thank you for this reminder. In the revised manuscript, we have provided a more cautious interpretation of the results (Page 43): “Although the fMRI findings revealed that no brain region associated with social cognition showed significant responses to harm or responsibility, this does not suggest that the human brain encodes only a unified measure integrating harm and responsibility and does not process them as separate entities. Using more fine-grained techniques, such as intracranial electrophysiological recordings, it may still be possible to observe independent neural representations of harm and responsibility.”

      Reviewer #3 (Public review):

      Summary

      Zhu et al. set out to elucidate how the moral emotions of guilt and shame emerge from specific cognitive antecedents - harm and responsibility - and how these emotions subsequently drive compensatory behavior. Consistent with their prediction derived from functionalist theories of emotion, their behavioral findings indicate that guilt is more influenced by harm, whereas shame is more influenced by responsibility. In line with previous research, their results also demonstrate that guilt has a stronger facilitating effect on compensatory behavior than shame. Furthermore, computational modeling and neuroimaging results suggest that individuals integrate harm and responsibility information into a composite representation of the individual's share of the harm caused. Brain areas such as the striatum, insula, temporoparietal junction, lateral prefrontal cortex, and cingulate cortex were implicated in distinct stages of the processing of guilt and/or shame. In general, this work makes an important contribution to the field of moral emotions. Its impact could be further enhanced by clarifying methodological details, offering a more nuanced interpretation of the findings, and discussing their potential practical implications in greater depth.

      Strengths

      First, this work conceptualizes guilt and shame as processes unfolding across distinct stages (cognitive appraisal, emotional experience, and behavioral response) and investigates the psychological and neural characteristics associated with their transitions from one stage to the next.

      Second, the well-designed experiment effectively manipulates harm and responsibility - two critical antecedents of guilt and shame.

      Third, the findings deepen our understanding of the mechanisms underlying guilt and shame beyond what has been established in previous research.

      We truly appreciate your acknowledgment of our work’s strengths and your encouraging feedback.

      We would like to note that, in accordance with the journal’s requirements, we have uploaded both a clean version of the revised manuscript and a version with all modifications highlighted in blue.

      Weakness

      Over the course of the task, participants may gradually become aware of their high error rate in the dot estimation task. This could lead them to discount their own judgments and become inclined to rely on the choices of other deciders. It is unclear whether participants in the experiment had the opportunity to observe or inquire about others' choices. This point is important, as the compensatory decision-making process may differ depending on whether choices are made independently or influenced by external input.

      Thank you for pointing this out. We apologize for not making the experimental procedure sufficiently clear. Participants (as deciders) were informed that each decider performed the dot estimation independently and was unaware of the estimations made by the other deciders. We now have clarified this point in the revised manuscript (Pages 10 and 11): “Each decider indicated whether the number of dots was more than or less than 20 based on their own estimation by pressing a corresponding button (dots estimation period, < 2.5 s) and was unaware of the estimations made by other deciders”.

      Given the inherent complexity of human decision-making, it is crucial to acknowledge that, although the authors compared eight candidate models, other plausible alternatives may exist. As such, caution is warranted when interpreting the computational modeling results.

      Thank you for this comment. We fully agree with your opinion. Although we tried to build a conceptually comprehensive model space based on prior research and our own understanding, we did not include all plausible models, nor would it be feasible to do so. We acknowledge it as a limitation in the revised manuscript (Page 47): “... although we aimed to construct a conceptually comprehensive computational model space informed by prior research and our own understanding, it does not encompass all plausible models. Future research is encouraged to explore additional possibilities.”

      I do not agree with the authors' claim that "computational modeling results indicated that individuals integrate harm and responsibility in the form of a quotient" (i.e., harm/responsibility). Rather, the findings appear to suggest that individuals may form a composite representation of the harm attributable to each individual (i.e., harm/the number of people involved). The explanation of the modeling results ought to be precise.

      We appreciate your comment and apologize for the imprecise description. In the revised manuscript, we now use the expressions “… integrate harm and the number of wrongdoers in the form of a quotient.” and “… the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.” For example, on Page 19, we state: “It assumes that individuals neglect their self-interest, have a compensatory baseline, and integrate harm and the number of wrongdoers in the form of a quotient.” On Page 3, we state: “Computational modeling results indicated that the integration of harm and responsibility by individuals is consistent with the phenomenon of responsibility diffusion.”

      Many studies have reported positive associations between trait gratitude, social value orientation, and altruistic behavior. It would be helpful if the authors could provide an explanation about why this study failed to replicate these associations.

      Thanks a lot for this important comment. We have now added an explanation into the revised manuscript (Page 47): “Although previous research has found that trait gratitude and SVO are significantly associated with altruistic behavior in contexts such as donation (Van Lange et al., 2007; Yost-Dubrow & Dunham, 2018) and reciprocity (Ma et al., 2017; Yost-Dubrow & Dunham, 2018), their associations with compensatory decisions in the present study were not significant. This suggests that the effects of trait gratitude and SVO on altruistic behavior are context-dependent and may not predict all forms of altruistic behavior.”

      As the authors noted, guilt and shame are closely linked to various psychiatric disorders. It would be valuable to discuss whether this study has any implications for understanding or even informing the treatment of these disorders.

      We are grateful for this advice. Although our study did not directly examine patients with psychological disorders, the findings offer insights into the regulation of guilt and shame. As these emotions are closely linked to various disorders, improving their regulation may help alleviate related symptoms. Accordingly, we have added a paragraph highlighting the potential clinical relevance (Pages 48 and 49): “Our study has potential practical implications. The behavioral findings may help counselors understand how cognitive interventions targeting perceptions of harm and responsibility could influence experiences of guilt and shame. The neural findings highlight specific brain regions (e.g., TPJ) as potential intervention targets for regulating these emotions. Given the close links between guilt, shame, and various psychological disorders (e.g., Kim et al., 2011; Lee et al., 2001; Schuster et al., 2021), strategies to regulate these emotions may contribute to symptom alleviation. Nevertheless, because this study was conducted with healthy adults, caution is warranted when considering applications to other populations.”

      Reviewer #1 (Recommendations for the authors):

      (1) Would it be interesting to explore other categories of behavior apart from compensatory behavior?

      Thanks a lot for this insightful question. We focused on a classic form of altruistic behavior, compensation. Future studies are encouraged to adapt our paradigm to examine other behaviors associated with guilt and/or shame, such as donation (Xu, 2022), avoidance (Shen et al., 2023), or aggression (Velotti et al., 2014). Please see Page 48: “Future research could combine this paradigm with other cognitive neuroscience methods, such as electroencephalography (EEG) or magnetoencephalography (MEG), and adapt it to investigate additional behaviors linked to guilt and shame, including donation (Xu, 2022), avoidance (Shen et al., 2023), and aggression (Velotti et al., 2014).”

      (2) Did the computational model account for the position of the block (slider) at the start of each decision-making response (when participants had to decide how to divide the endowment)? Or are anchoring effects not relevant/ not a concern?

      Thank you for this interesting question. In our task, the initial position of the slider was randomized across trials, and participants were explicitly informed of this in the instructions. This design minimized stable anchoring effects across trials, as participants could not rely on a consistent starting point. Although anchoring might still have influenced individual trial responses, we believe it is unlikely that such effects systematically biased our results, since randomization would tend to cancel them out across trials. Additionally, prior research has shown that when multiple anchors are presented, anchoring effects are reduced if the anchors contradict each other (Switzer

      III & Sniezek, 1991). Therefore, we did not attempt to model potential anchoring effects. Nevertheless, future research could systematically manipulate slider starting positions to directly examine possible anchoring influences. In the revised manuscript, we have added a brief clarification (Page 11): “The initial position of the block was randomized across trials, which helped minimize stable anchoring effects across trials.”

      (3) Was there a real receiver who experienced the shocks and received compensation? I think it is not completely clear in the paper.

      We are sorry for not making this clear enough. The receiver was fictitious and did not actually exist. We have supplemented the Methods section with the following description (Page 12): “We told the participant a cover story that the receiver was played by another college student who was not present in the laboratory at the time. … In fact, the receiver did not actually exist.”.

      (4) What was the rationale behind not having participants meet the receiver?

      Thank you for this question. Having participants meet the receiver (i.e., the victim), played by a confederate, might have intensified their guilt and shame and produced a ceiling effect. In addition, the current approach simplified the experimental procedure and removed the need to recruit an additional confederate. These reasons have been added to the Methods section (Page 12): “Not having participants meet the receiver helped prevent excessive guilt and shame that might produce a ceiling effect, while also eliminating the need to recruit an additional confederate.”

      Minor edits:

      (1) Line 49: "the cognitive assessment triggers them", I think a word is missing.

      (2) Line 227: says 'Slide' instead of 'Slider'.

      (3) Lines 867/868: "No brain response showed significant correlation with responsibility-driven guilt sensitivity, harm-driven shame sensitivity, or responsibilitydriven shame sensitivity." I think it should be harm-driven guilt sensitivity, responsibility-driven guilt sensitivity, and harm-driven shame sensitivity.

      (4) Supplementary Information Line 12: I think there is a typo ( 'severs' instead of 'serves')

      We sincerely thank you for patiently pointing out these typos. We have corrected them accordingly. 

      (1) “the cognitive assessment triggers them” has been revised to “the cognitive antecedents that trigger them” (Page 2).

      (2) “SVO Slide Measure” has been revised to “SVO Slider Measure” (Page 8).

      (3) “No brain response showed significant correlation with responsibility-driven guilt sensitivity, harm-driven shame sensitivity, or responsibility-driven shame sensitivity." has been revised to “No brain response showed significant correlation with harm-driven guilt sensitivity, responsibility-driven guilt sensitivity, and harm-driven shame sensitivity.” (Page 35).

      (4) “severs” has been revised to “serves” (see Supplementary Information). In addition, we have carefully checked the entire manuscript to correct any remaining typographical errors.

      Reviewer #2 (Recommendations for the authors):

      The statement that trait gratitude and SVO were measured "for exploratory purposes" would benefit from further clarification regarding the specific questions being explored.

      Thank you for this valuable suggestion. In the revised manuscript, we have illustrated the exploratory purposes (Page 9): “We measured trait gratitude and SVO for exploratory purposes. Previous research has shown that both are linked to altruistic behavior, particularly in donation contexts (Van Lange et al., 2007; Yost-Dubrow & Dunham, 2018) and reciprocity contexts (Ma et al., 2017; Yost-Dubrow & Dunham, 2018). Here, we explored whether they also exert significant effects in a compensatory context.”

      In the Methods section, the authors state: "To confirm the relationships between κ and guilt-driven and shame-driven compensatory sensitivities, we calculated the Pearson correlations between them." However, the Results section reports linear regression results rather than Pearson correlation coefficients, suggesting a possible inconsistency. The authors are advised to carefully check and clarify the analysis approach used.

      We thank you for the careful reviewing and apologize for this mistake. We used a linear mixed-effects regression instead of Pearson correlations for the analysis. The mistake has been revised (Page 25): “To confirm the relationships between κ and guiltdriven and shame-driven compensatory sensitivities, we conducted a linear mixedeffects regression. κ was regressed onto guilt-driven and shame-driven compensatory sensitivities, with participant-specific random intercepts and random slopes for each fixed effect included as random effects.”

      A more detailed discussion of how the current findings inform the regulation of guilt and shame would further strengthen the contribution of this study.

      Thank you for this suggestion. We have added a paragraph discussing the implications for the regulation of guilt and shame (Pages 48 and 49): “Our study has potential practical implications. The behavioral findings may help counselors understand how cognitive interventions targeting perceptions of harm and responsibility could influence experiences of guilt and shame. The neural findings highlight specific brain regions (e.g., TPJ) as potential intervention targets for regulating these emotions. Given the close links between guilt, shame, and various psychological disorders (e.g., Kim et al., 2011; Lee et al., 2001; Schuster et al., 2021), strategies to regulate these emotions may contribute to symptom alleviation. Nevertheless, because this study was conducted with healthy adults, caution is warranted when considering applications to other populations.”

      As fMRI provides only correlational evidence, establishing a causal link between neural activity and guilt- or shame-related cognition and behavior would require brain stimulation or other intervention-based methods. This may represent a promising direction for future research.

      Thank you for this advice. We also agree that it is important for future research to establish the causal relationships between the observed brain activity, psychological processes, and behavior. We have added a corresponding discussion in the revised manuscript (Pages 47 and 48): “… fMRI cannot establish causality. Future studies using brain stimulation techniques (e.g., transcranial magnetic stimulation) are needed to clarify the causal role of brain regions in guilt-driven and shame-driven altruistic behavior.”

      Reviewer #3 (Recommendations for the authors):

      It was mentioned that emotions beyond guilt and shame, such as indebtedness, may also drive compensation. Were any additional types of emotion measured in the study?

      Thank you for this question. We did not explicitly measure emotions other than guilt and shame. However, the parameter κ from our winning computational model captures the combined influence of various psychological processes on compensation, which may reflect the impact of emotions beyond guilt and shame (e.g., indebtedness). We acknowledge that measuring other emotions similar to guilt and shame may help to better understand their distinct contributions. This point has been added into the revised manuscript (Page 48): “… we did not explicitly measure emotions similar to guilt and shame (e.g., indebtedness), which would have been helpful for understanding their distinct contributions.”

      The experimental task is complicated, raising the question of whether participants fully understood the instructions. For instance, one participant's compensation amount was zero. Could this reflect a misunderstanding of the task instructions?

      Thanks a lot for this question. In our study, after reading the instructions, participants were required to complete a comprehension test on the experimental rules. If they made any mistakes, the experimenter provided additional explanations. Only after participants fully understood the rules and correctly answered all comprehension questions did they proceed to the main experimental task. We have clarified this procedure in the revised manuscript (Page 13): “Participants did not proceed to the interpersonal game until they had fully understood the experimental rules and passed a comprehension test.”

      Making identical choices across different trials does not necessarily indicate that participants misunderstood the rules. Similar patterns, where participants made the same choices across trials, have also been observed in previous studies (Zhong et al., 2016; Zhu et al., 2021).

      Reference

      Cohen, T. R., Wolf, S. T., Panter, A. T., & Insko, C. A. (2011). Introducing the GASP scale: a new measure of guilt and shame proneness. Journal of Personality and Social Psychology, 100(5), 947–966. https://doi.org/10.1037/a0022641

      Esterman, M., Tamber-Rosenau, B. J., Chiu, Y. C., & Yantis, S. (2010). Avoiding nonindependence in fMRI data analysis: Leave one subject out. NeuroImage, 50(2), 572–576. https://doi.org/10.1016/j.neuroimage.2009.10.092

      Kim, S., Thibodeau, R., & Jorgensen, R. S. (2011). Shame, guilt, and depressive symptoms: A meta-analytic review. Psychological Bulletin, 137(1), 68. https://doi.org/10.1037/a0021466

      Lee, D. A., Scragg, P., & Turner, S. (2001). The role of shame and guilt in traumatic events: A clinical model of shame-based and guilt-based PTSD. British Journal of Medical Psychology, 74(4), 451–466. https://doi.org/10.1348/000711201161109

      Ma, L. K., Tunney, R. J., & Ferguson, E. (2017). Does gratitude enhance prosociality?: A meta-analytic review. Psychological Bulletin, 143(6), 601–635. https://doi.org/10.1037/bul0000103

      Michl, P., Meindl, T., Meister, F., Born, C., Engel, R. R., Reiser, M., & Hennig-Fast, K. (2014). Neurobiological underpinnings of shame and guilt: A pilot fMRI study. Social Cognitive and Affective Neuroscience, 9(2), 150–157.

      Schuster, P., Beutel, M. E., Hoyer, J., Leibing, E., Nolting, B., Salzer, S., Strauss, B., Wiltink, J., Steinert, C., & Leichsenring, F. (2021). The role of shame and guilt in social anxiety disorder. Journal of Affective Disorders Reports, 6, 100208. https://doi.org/10.1016/j.jadr.2021.100208

      Shen, B., Chen, Y., He, Z., Li, W., Yu, H., & Zhou, X. (2023). The competition dynamics of approach and avoidance motivations following interpersonal transgression. Proceedings of the National Academy of Sciences, 120(40), e2302484120. https://doi.org/10.1073/pnas.230248412

      Switzer III, F. S., & Sniezek, J. A. (1991). Judgment processes in motivation: Anchoring and adjustment effects on judgment and behavior. Organizational Behavior and Human Decision Processes, 49(2), 208–229. https://doi.org/10.1016/0749-5978(91)90049-Y

      Van Lange, P. A. M., Bekkers, R., Schuyt, T. N. M., & Van Vugt, M. (2007). From games to giving: Social value orientation predicts donations to noble causes. Basic and Applied Social Psychology, 29(4), 375–384. https://doi.org/10.1080/01973530701665223

      Velotti, P., Elison, J., & Garofalo, C. (2014). Shame and aggression: Different trajectories and implications. Aggression and Violent Behavior, 19(4), 454–461. https://doi.org/10.1016/j.avb.2014.04.011

      Wagner, U., N’Diaye, K., Ethofer, T., & Vuilleumier, P. (2011). Guilt-specific processing in the prefrontal cortex. Cerebral Cortex, 21(11), 2461–2470. https://doi.org/10.1093/cercor/bhr016

      Wu, X., Ren, X., Liu, C., & Zhang, H. (2024). The motive cocktail in altruistic behaviors. Nature Computational Science, 4, 659–676. https://doi.org/10.1038/s43588-024-00685-6

      Xu, J. (2022). The impact of guilt and shame in charity advertising: The role of self- construal. Journal of Philanthropy and Marketing, 27(1). https://doi.org/10.1002/nvsm.1709

      Yost-Dubrow, R., & Dunham, Y. (2018). Evidence for a relationship between trait gratitude and prosocial behaviour. Cognition and Emotion, 32(2), 397–403. https://doi.org/10.1080/02699931.2017.1289153

      Yu, H., Gao, X., Zhou, Y., & Zhou, X. (2018). Decomposing gratitude: Representation and integration of cognitive antecedents of gratitude in the brain. Journal of Neuroscience, 38(21), 4886–4898. https://doi.org/10.1523/JNEUROSCI.2944-17.2018

      Zhong, S., Chark, R., Hsu, M., & Chew, S. H. (2016). Computational substrates of social norm enforcement by unaffected third parties. NeuroImage, 129, 95–104. https://doi.org/10.1016/j.neuroimage.2016.01.040

      Zhu, R., Feng, C., Zhang, S., Mai, X., & Liu, C. (2019). Differentiating guilt and shame in an interpersonal context with univariate activation and multivariate pattern analyses. NeuroImage, 186, 476486. https://doi.org/10.1016/j.neuroimage.2018.11.012

      Zhu, R., Xu, Z., Su, S., Feng, C., Luo, Y., Tang, H., Zhang, S., Wu, X., Mai, X., & Liu, C. (2021). From gratitude to injustice: Neurocomputational mechanisms of gratitude-induced injustice. NeuroImage, 245, 118730. https://doi.org/10.1016/j.neuroimage.2021.118730

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)::

      Summary:

      The work used open peer reviews and followed them through a succession of reviews and author revisions. It assessed whether a reviewer had requested the author include additional citations and references to the reviewers' work. It then assessed whether the author had followed these suggestions and what the probability of acceptance was based on the authors decision.

      Strengths and weaknesses:

      The work's strengths are the in-depth and thorough statistical analysis it contains and the very large dataset it uses. The methods are robust and reported in detail. However, this is also a weakness of the work. Such thorough analysis makes it very hard to read! It's a very interesting paper with some excellent and thought provoking references but it needs to be careful not to overstate the results and improve the readability so it can be disseminated widely. It should also discuss more alternative explanations for the findings and, where possible, dismiss them.

      I have toned down the language including a more neutral title. To help focus on the main results, I have moved four paragraphs from the methods to the supplement. These are the sample size, the two sensitivity analyses on including co-reviewers and confounding by reviewers’ characteristics, and the analysis examining potential bias for the reviewers with no OpenAlex record.

      Reviewer #2 (Public review):

      Summary:

      This article examines reviewer coercion in the form of requesting citations to the reviewer's own work as a possible trade for acceptance and shows that, under certain conditions, this happens.

      Strengths:

      The methods are well done and the results support the conclusions that some reviewers "request" self-citations and may be making acceptance decisions based on whether an author fulfills that request.

      Weaknesses:

      The author needs to be more clear on the fact that, in some instances, requests for selfcitations by reviewers is important and valuable.

      This is a key point. I have included a new text analysis to examine this issue and have addressed this in the updated discussion.

      Reviewer #3 (Public review):

      Summary:

      In this article, Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Strengths:

      The author uses a clever analytical design, examining four journals that use the same open peer review system, in which the identities of the authors and reviewers are both available and linkable to structured data. Categorical information about the approval is also available as structured data. This design allows a large scale investigation of this question.

      Weaknesses:

      My concerns pertain to the interpretability of the data as presented and the overly terse writing style.

      Regarding interpretability, it is often unclear what subset of the data are being used both in the prose and figures. For example, the descriptive statistics show many more Version 1 articles than Version 2+. How are the data subset among the different possible methods?

      I have now included the number of articles and reviews in the legends of each plot. There are more version 1 articles because some are “approved” at this stage and hence a second version is never submitted (I’ve now specifically mentioned this in the discussion).

      Likewise, the methods indicate that a matching procedure was used comparing two reviewers for the same manuscript in order to control for potential confounds. However, the number of reviews is less than double the number of Version 1 articles, making it unclear which data were used in the final analysis. The methods also state that data were stratified by version. This raises a question about which articles/reviews were included in each of the analyses. I suggest spending more space describing how the data are subset and stratified. This should include any conditional subsetting as in the analysis on the 441 reviews where the reviewer was not cited in Version 1 but requested a citation for Version 2. Each of the figures and tables, as well as statistics provided in the text should provide this information, which would make this paper much more accessible to the reader.

      [Note from editor: Please see "Editorial feedback" for more on this]

      The numbers are now given in every figure legend, and show the larger sample size for the first versions.

      The analysis of the 441 reviews was an unplanned analysis that is separate to the planned models. The sample size is much smaller than the main models due to the multiple conditions applied to the reviewers: i) reviewed both versions, ii) not cited in first version, iii) requested a self-citation in their first review.

      Finally, I would caution against imputing motivations to the reviewers, despite the important findings provided here. This is because the data as presented suggest a more nuanced interpretation is warranted. First, the author observes similar patterns of accept/reject decisions whether the suggested citation is a citation to the reviewer or not (Figs 3 and 4). Second, much of the observed reviewer behavior disappears or has much lower effect sizes depending on whether "Accept with Reservations" is considered an Accept or a Reject. This is acknowledged in the results text, but largely left out of the discussion. The conditional analysis on the 441 reviews mentioned above does support a more cautious version of the conclusion drawn here, especially when considered alongside the specific comments left by reviewers that were mentioned in the results and information in Table S.3. However, I recommend toning the language down to match the strength of the data.

      I have used more cautious language throughout, including a new title. The new text analysis presented in the updated version also supports a more cautious approach.

      Reviewer #4 (Public review):

      Summary:

      This work investigates whether a citation to a referee made by a paper is associated with a more positive evaluation by that referee for that paper. It provides evidence supporting this hypothesis. The work also investigates the role of self citations by referees where the referee would ask authors to cite the referee's paper.

      Strengths:

      This is an important problem: referees for scientific papers must provide their impartial opinions rooted in core scientific principles. Any undue influence due to the role of citations breaks this requirement. This work studies the possible presence and extent of this.

      Barring a few issues discussed below, the methods are solid and well done. The work uses a matched pair design which controls for article-level confounding and further investigates robustness to other potential confounds.

      It is surprising that even in these investigated journals where referee names are public, there is prevalence of such citation-related behaviors.

      Weaknesses:

      Some overall claims are questionable:

      "Reviewers who were cited were more likely to approve the article, but only after version 1" It also appears that referees who were cited were less likely to approve the article in version 1. This null or slightly negative effect undermines the broad claim of citations swaying referees. The paper highlights only the positive results while not including the absence (and even reversal) of the effect in version 1 in its narrative.

      The reversed effect for version 1 is interesting, but the adjusted 99.4% confidence interval includes 1 and hence it’s hard to be confident that this is genuinely in the reverse direction. However, it is certainly far from the strongly positive association for versions 2+.

      "To the best of our knowledge, this is the first analysis to use a matched design when examining reviewer citations" Does not appear to be a valid claim based on the literature reference [18]

      This previous paper used a matched design but then did not used a matched analysis. Hence, I’ve changed the text in my paper to “first analysis to use a matched design and analysis”. This may seem a minor claim of novelty, but not using a matched analysis for matched data could discard much of the benefits of the matching.

      It will be useful to have a control group in the analysis associated to Figure 5 where the control group comprises matched reviews that did not ask for a self citation. This will help demarcate words associated with approval under self citation (as compared to when there is no self citation). The current narrative appears to suggest an association of the use of these words with self citations but without any control.

      Thanks for this useful suggestion. I have added a control group of reviewers who requested citations to articles other than their own. The words requested were very similar to the previous analysis, hence I’ve needed to reinterpret the results from the text analysis as “please” and “need” are not exclusively used by those requesting selfcitations. I also fixed a minor error in the text analysis concerning the exclusion of abstracts of shorter than 100 characters.

      More discussion on the recommendations will help:

      For the suggestion that "the reviewers initially see a version of the article with all references blinded and no reference list" the paper says "this involves more administrative work and demands more from peer reviewers". I am afraid this can also degrade the quality of peer review, given that the research cannot be contextualized properly by referees. Referees may not revert back to all their thoughts and evaluations when references are released afterwards.

      This is an interesting point, but I don’t think it’s certain that this would happen. For example, revisiting the review may provide a fresh perspective and new ideas; this sometimes happens for me when I review the second version of an article. Ideally an experiment is needed to test this approach, as it is difficult to predict how authors and reviewers will react.

      Recommendations for the Authors:

      Editorial feedback:

      I wonder if the article would benefit from a shorter title, such as the one suggested below. However, please feel free to not change the title if you prefer.

      [i] Are peer reviewers influenced by their work being cited (or not)?

      I like the slightly simpler: “Are peer reviewers influenced by their work being cited?”

      [ii] To better reflect the findings in the article, please revise the abstract along the following lines:

      Peer reviewers for journals sometimes write that one or more of their own articles should have been cited in the article under review. In some cases such comments are justified, but in other cases they are not. Here, using a sample of more than 37000 peer reviews for four journals that use open peer review and make all article versions available, we use a matched study design to explore this and other phenomena related to citations in the peer review process. We find that reviewers who were cited in the article under review were less likely to approve the original version of an article compared with reviewers who were not cited (odds ratio = 0.84; adjusted 99.4% CI: 0.69-1.03), but were more likely to approve a revised article in which they were cited (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23). Moreover, for all versions of an article, reviewers who asked for their own articles to be cited were much less likely to approve the article compared with reviewers who did not do this (odds ratio = 0.15; adjusted 99.4% CI: 0.08-0.30). However, reviewers who had asked for their own articles to be cited were much more likely to approve a revised article that cited their own articles compared to a revised article that did not (odds ratio = 3.5; 95% CI: 2.0-6.1).

      I have re-written the abstract along the lines suggested. I have not included the finding that cited reviewers were less likely to approve the article due to the adjusted 99.4% interval including 1.

      [iii] The use of the phrase "self-citation" to describe an author citing an article by one of the reviewers is potentially confusing, and I suggest you avoid this phrase if possible.

      I have removed “self-citation” everywhere and instead used “citations to their own articles”.

      [iv] I think the captions for figures 2, 3 and 4 from benefit from rewording to more clearly describe what is being shown in the figure. Please consider revising the caption for figure 2 as follows, and revising the captions for figures 3 and 4 along similar lines. Please also consider replotting some of the panels so that the values on the horizontal axes of the top panel align with the values on the bottom panel.

      I have aligned the odds and probability axes as suggested which better highlights the important differences. I have updated the figure captions as outlined.

      Figure 2: Odds ratios and probabilities for reviewers giving a more or less favourable recommendation depending on whether they were cited in the article.

      Top left: Odds ratios for reviewers giving a more favourable (Approved) or less favourable (Reservations or Not approved) recommendation depending on whether they were cited in the article. Reviewers who were cited in version 1 of the article (green) were less likely to make a favourable recommendation (odds ratio = 0.84; adjusted 99.4% CI: 0.691.03), but they were more likely to make a favourable recommendation (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23) if they were cited in a subsequent version (blue). Top right: Same data as top left displayed in terms of probabilities. From the top, the lines show the probability of a reviewer approving: a version 1 article in which they are not cited (please give mean value and CI); a version 1 article in which they are cited (mean value and CI); a version 2 (or higher) article in which they are not cited (mean value and CI); and a version 2 (or higher) article in which they are cited (mean value and CI).

      Bottom left: Same data as top left except that more favourable is now defined as Approved or Reservations, and less favourable is defined as Not approved. Again, reviewers who were cited in version 1 were less likely to make a favourable recommendation (odds ratio = 0.84; adjusted 99.4% CI: 0.57-1.23),and reviewers who were cited in subsequent versions were more likely to make a favourable recommendation (odds ratio = 1.12; adjusted 99.4% CI: 0.59-2.13).

      Bottom right: Same data as bottom left displayed in terms of probabilities. From the top, the lines show the probability of a reviewer approving: a version 1 article in which they are not cited (please give mean value and CI); a version 1 article in which they are cited (mean value and CI); a version 2 (or higher) article in which they are not cited (mean value and CI); and a version 2 (or higher) article in which they are cited (mean value and CI).

      This figure is based on an analysis of [Please state how many articles, reviewers, reviews etc are included in this analysis].

      In all the panels a dot represents a mean, and a horizontal line represents an adjusted 99.4% confidence interval.

      Reviewer #1 (Recommendations for the Authors):

      A big recommendation to the author would be to consider putting a lot of the statistical analysis in an appendix and describing the methods and results in more accessible terms in the main text. This would help more readers see the baby through the bath water

      I have moved four paragraphs from the methods to the supplement. These are the sample size, the two sensitivity analyses on including co-reviewers and confounding by reviewers’ characteristics, and the analysis examining potential bias for the reviewers with no OpenAlex record.

      One possibility, that may have been accounted for, but it is hard to say given the density of the analysis, is the possibility that an author who follows the recommendations to cite the reviewer has also followed all the other reviewer requests. This could account for the much higher likelihood of acceptance. Conversely an author who has rejected the request to cite the reviewer may be more likely to have rejected many of the other suggestions leading to a rejection. I couldn't discern whether the analysis had accounted for this possibility. If it has it need to be said more prominently, if it hasn't this possibility at least needs to be discussed. It would be good to see other alternative explanations for the results discussed (and if possible dismissed) in the discussion section too.

      This is an interesting idea. It’s also possible that authors more often accept and include any citation requests as it gives them more license to push back on other more involved changes that they would prefer not to make, e.g., running a new analysis. To examine this would require an analysis of the authors’ responses to the reviewers, and I have now added this as a limitation.

      I hope this paper will have an impact on scientific publishing but I fear that it won't. This is no reflection on the paper but a more a reflection on the science publishing system.

      I do not have any additional references (written by myself or others!) I would like the author to include

      Thanks. I appreciate that extra thought is needed when peer reviewing papers on peer review. I do not know the reviewers’ names! I have added one additional reference suggested by the reviewers which had relevant results on previous surveys of coercive citations for the section on “Related research”.

      Reviewer #2 (Recommendations for the Authors):

      (1) Would it be possible for the author to control for academic discipline? Some disciplines cite at different rates and have different citation sub-cultures; for example, Wilhite and Fong (2012) show that editorial coercive citation differs among the social science and business disciplines. Is it possible that reviewers from different disciplines just take a totally different view of requesting self-citations?

      Wilhite, A.W., & Fong, E.A. 2012. Coercive citation in academic publishing. Science, 335: 542-543.

      This is an interesting idea, but the number of disciplines would need to be relatively broad to keep a sufficient sample size. The Catch-22 is then whether broad disciplines are different enough to show cultural differences. Overall, this is an idea for future work.

      (2) I would like the author to be much more clear about their results in the discussion section. In line 214, they state that "Reviewers who requested a self-citation were much less likely to approve the article for all versions." Maybe in the discussion some language along the lines of "Although reviewers who requested self-citation were actually much less likely to approve an article, my more detailed analyses show that this was not the case when reviewers requested a self-citation without reason or with the inclusion of coercive language such as 'need' or 'please'." Again, word it as you like, but I think it should be made clear that requests for self-citation alone is not a problem. In fact, I would argue that what the author says in lines 250 to 255 in the discussion reflects that reviewers who request self-citations (maybe for good reasons) are more likely to be the real experts in the area and why those who did not request a self-cite did not notice the omission. It is my understanding that editors are trying to get warm bodies to review and thus reviewers are not all equally qualified. Could it be that requesting self-citations for a good reason is a proxy for someone who actually knows the literature better? I'm not saying this is s fact, but it is a possibility. I get this is said in the abstract, but worth fleshing out in the discussion.

      I have updated the discussion after a new text analysis and have addressed this important question of whether self-citations are different from citations to other articles. The idea that some self-citers are more aware of the relevant literature is interesting, although this is very hard to test because they could also just be more aware of their own work. The question of whether self-citations are justified is a key question and one that I’ve tried to address in an updated discussion.

      Reviewer #3 (Recommendations for the Authors):

      Data and code availablility are in good shape. At a high level, I recommend:

      Toning down the interpretation of reviewers' motivation, especially since some of this is mitigated by findings presented in the paper.

      I have reworded the discussion and included a warning on the observational study design.

      Devote more time detailing exactly what data are being presented in each figure/table and results section as described in more detail in the main review (n, selection criteria, conditional subsetting, etc.).

      I agree and have provided more details in each figure legend.

      Reviewer #4 (Recommendations for the Authors):

      A few aspects of the paper are not clear:

      I did not follow Figure 4. Are the "self citation" labels supposed to be "citation to other research"?

      Thanks for picking up this error which has now been fixed.

      I did not understand how to parse the left column of Figure 2

      As per the editor’s suggestion, the figure legend has been updated.

      Table 3: Please use different markers for the different curves so that it is clearly demarcated even in grayscale print

      I presume you meant Figure 3 not Table 3. I’ve varied the symbols in all three odds ratio plots.

      Supplementary S3: Typo "Approvep" Fixed, thanks.

      OTHER CHANGES: As well as the four reviews, my paper was reviewed by an AI-reviewer which provided some useful suggestions. I have mentioned this review in the acknowledgements. I have reversed the order of figure 5 to show the probability of “Approved” as this is simpler to interpret.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers for their detailed comments, which have already helped us improve our manuscript. The responses below detail changes we have already made as part of the Review Commons revision plan, and further changes we expect to make in a longer revision period.


      __Reviewer #1 __

      Major points __ It is mentioned throughout the manuscript that 3 plates were evaluated per line. I believe these are independently differentiated plates. This detail is critical concerning rigor and reproducibility. This should be clearly stated in the Methods section and in the first description of the experimental system in the Results section for Figure 1.__

      These experimental details have now been clarified. Unless otherwise stated, all findings were confirmed in three independently differentiated plates from the same line or at least one differentiation from each of three lines.

      For the patient-specific lines - how many lines were derived per patient?

      This has now been clarified in the methods. Microfluidic reprogramming of a small number of amniocytes produces one line per patient representing a pool of clones. Subcloning from individual cells would not be possible within the timeframe of a pregnancy.

      Methods: For patient-specific iPSC lines, one independent iPSC line was obtained per patient following microfluidic mmRNA reprogramming.

      Was the Vangl2 variant introduced by prime editing? Base editing? The details of the methods are sparse.

      We have now expanded these details:

      Methods: VANGL2 knock-in lines were generated using CRSIPR-Cas9 homology directed repair editing by Synthego (SO-9291367-1). The guide sequence was AUGAGCGAAGGGUGCGCAAG and the donor sequence was CAATGAGTACTACTATGAGGAGGCTGAGCATGAGCGA AGGGTGTGCAAGAGGAGGGCCAGGTGGGTCCCTGGGGGAGAAGAGGAGAG. Sequence modification was confirmed by Sanger sequencing before delivery of the modified clones, and Sanger sequencing was repeated after expansion of the lines (Supplementary Figure 5) as well as SNP arrays (Illumina iScan, not shown) confirming genomic stability.

      Some additional suggestions for improvement. __ The abstract could be more clearly written to effectively convey the study's importance. Here are some suggestions.__

      Line 26: Insert "apicobasal" before "elongation" - the way it is written, I initially interpreted it as anterior-posterior elongation.

      Line 29: Please specify that the lines refer to 3 different established parent iPSC lines with distinct origins and established using different reprogramming methods, plus 2 control patient-derived lines. - The reproducibility of the cell behaviors is impressive, but this is not captured in the abstract.

      Line 32: add that this mutation was introduced by CRISPR-Cas9 base/prime editing.

      The last sentence of the abstract states that the study only links apical constriction to human NTDs, but also reveals that neural differentiation and apical-basal elongation were found. __ The introduction could also use some editing. __ Line 71: insert "that pulls actin filaments together" after "power strokes" __ Line 73: "apically localized," do you mean "mediolaterally" or "radially"? __ Line 75: Can you specify that PCP components promote "mediolaterally orientated" apical constriction __ Lines 127: Specify that NE functions include apical basal elongation and neurodifferentiation are disrupted in patient-derived models__

      These text changes have all been made.

      Reviewer #2:____ __ __Major comments: __ 1. Figure 1. The authors use F-actin to segment cell areas. Perhaps this could be done more accurately with ZO-1, as F-actin cables can cross the surface of a single cell. In any case, the authors need to show a measure of segmentation precision: segmented image vs. raw image plus a nuclear marker (DAPI, H2B-GFP), so we can check that the number of segmented cells matches the number of nuclei.__

      We used ZO-1 to quantify apical areas of the VANGL2-konckin lines in Figure 3. Segmentation of neuroepithelial apical areas based on F-actin staining is commonplace in the field (e.g. Fig 9 of Bogart & Brooks 2025 as a recent example), and is generally robust because the cell junctions are much brighter than any apical fibres not associated with the apical cortex. However, we accept that at earlier stages of differentiation there may be more apical fibres when cells are cuboidal. We have therefore repeated our analysis of apical area using ZO-1 staining as suggested, shown in the new Supplementary Figure 1, analysing a more temporally-detailed time course in one iPSC line. This new analysis confirms our finding of lack of apical area change between days 2-4 of differentiation, then progressive reduction of apical area between days 4-8, further validating our system. Including nuclear images is not helpful because of the high nuclear index of pseudostratified epithelia (e.g. see Supplementary Figure 7) which means that nuclei overlap along the apicobasal axis. Individual nuclei cannot be related to their apical surface in projected images.

      __2.Lines 156-166. The authors claim that changes in gene expression precede morphological changes. I am not convinced this is supported by their data. Fig. 1g (epithelial thickness) and Fig. 1k (PAX6 expression) seem to have similar dynamics. The authors can perform a cross-correlation between the two plots to see which Δt gives maximum correlation. If Δt __We are happy to do this analysis fully in revision. __Our initial analysis performing cross-correlation between apical area and CDH2 protein in one line shows the highest cross-correlation at Δt = -1, suggesting neuroepithelial CDH2 increases before apical area decreases. In contrast, the same analysis comparing apical area versus PAX6 shows Δt = 0, suggesting concurrence. This analysis will be expanded to include the other markers we quantified and the manuscript text amended accordingly. We are keen to undertake additional experiments to test whether these cells swap their key cadherins - CDH1 and CDH2 - before they begin to undergo morphological changes (see the response to Reviewer 3's minor comment 1 immediately below).

      3. Figure 2d. The laser ablation experiment in the presence of ROCK inhibitor is clear, as I can easily see the cell outlines before and after the experiment. In the absence of ROCK inhibitor, the cell edges are blurry, and I am not convinced the outline that the authors drew is really the cell boundary. Perhaps the authors can try to ablate a larger cell patch so that the change in area is more defined.

      The outlines on these images are not intended to show cell boundaries, but rather link landmarks visible at both timepoints to calculate cluster (not cell) change in area. This is as previously shown in Galea et al Nat Commun 2021 and Butler et al J Cell Sci 2019. We have now amended the visualisation of retraction in Figure 2 to make representation of differences between conditions more intuitive.

      4. Figure 2d. Do the cells become thicker after recoil?

      This is unlikely because the ablated surface remains in the focal plane. Unfortunately, we are unable to image perpendicularly to the direction of ablation to test whether their apical surface moves in Z even by a very small amount. This has now been clarified in the results:

      Results: The ablated surface remained within the focal plane after ablation, indicating minimal movement along the apical-basal axis.

      5. Figure 3. The authors mention their previous study in which they show that Vangl2 is not cell-autonomously required for neural closure. It will be interesting to study whether this also the case in the present human model by using mosaic cultures.

      We agree with the reviewer that this is one of the exciting potential future applications of our model, which will first require us to generate stable fluorescently-tagged lines (to identify those cells which lack VANGL2). We will also need to extensively analyze controls to validate that mixing fluo-tagged and untagged lines does not alter the homogeneity of differentiation, or apical constriction, independently of VANGL2 deletion. As such, the reviewer is suggesting an altogether new project which carries considerable risk and will require us to secure dedicated funding to undertake.

      6. Lines 403-415. The authors report poor neural induction and neuronal differentiation in GOSB2. As far as I understand, this phenotype does not represent the in vivo situation. Thus, it is not clear to what extent the in vitro 2D model describes the human patient.

      The GOSB2 iPSC line we describe does represent the in vivo situation in Med24 knockout mouse embryos, but is clearly less severe because we are still able to detect MED24 protein expressed in this line. We do not have detailed clinical data of the patient from which this line was obtained to determine whether their neurological development is normal. However, it is well established that some individuals who have spina bifida also have abnormalities in supratentorial brain development. It is therefore likely that abnormalities in neuron differentiation/maturation are concomitant with spina bifida. Our findings in the GOSB2 line complement earlier studies which also identified deficiencies in the ability of patient-derived lines to form neurons, but were unable to functionally assess neuroepithelial cell behaviours we studied. This has now been clarified in the discussion:

      Discussion: *Neuroepithelial cells of the GOSB2 line described here, which has partial loss of MED24, similarly produces a thinner neuroepithelium with larger apical areas. Although apical areas were not analysed in mouse models of Med24 deletion, these embryos also have shorter and non-pseudostratified neuroepithelium. *

      Our GOSB2 line - which retains readily detectable MED24 protein - is clearly less severe than the mouse global knockout, and the clinical features of the patient from which this line was derived are milder than the phenotype of Med24 knockout embryos68. Mouse embryos lacking one of Med24's interaction partners in the mediator complex, Med1, also have thinner neuroepithelium and diminished neuronal differentiation but successfully close their neural tube85.

      7.The experimental feat to derive cell lines from amniotic fluid and to perform experiments before birth is, in my view, heroic. However, I do not feel I learned much from the in vitro assays. There are many genetic changes that may cause the in vivo phenotype in the patient. The authors focus on MED24, but there is not enough convincing evidence that this is the key gene. I would like to suggest overexpression of MED24 as a rescue experiment, but I am not sure this is a single-gene phenotype. In addition, the fact that one patient line does not differentiate properly leads me to think that the patient lines do not strengthen the manuscript, and that perhaps additional clean mutations might contribute more.

      We thank the reviewer for their praise of our personalised medicine approach and fully agree that neural tube defects are rarely monogenic. The patient lines we studied were not intended to provide mechanistic insight, but rather to demonstrate the future applicability of our approach to patient care. Our vision is that every patient referred for fetal surgery of spina bifida will have amniocytes (collected as part of routine cystocentesis required before surgery) reprogrammed and differentiated into neuroepithelial cells, then neural progenitors, to help stratify their post-natal care. One could also picture these cells becoming an autologous source for future cell-based therapies if they pass our reproducible analysis pipeline as functional quality control. This has now been clarified in the discussion:

      Discussion____: The multi-genic nature of neural tube defect susceptibility, compounded by uncontrolled environmental risk factors (including maternal age and parity102), mean that patient-derived iPSC models are unlikely to provide mechanistic insight. They do provide personalised disease models which we anticipate will enable functional validation of genetic diagnoses for patients and their parents' recurrence risk in future pregnancies, and may eventually stratify patients' postnatal care. We also envision this model will enable quality control of patient-derived cells intended for future autologous cell replacement therapies, as is being developed in post-natal spinal cord injury103.

      Minor comments: __ 1.Figure 1c. Text is cropped at the edge of the image.__

      This image has been corrected.

      Reviewer #2 (Significance (Required)): __ ...In addition, the model was unsuccessful in one of the two patient-derived lines, which limits generalizability and weakens claims of patient-specific predictive value.__

      We disagree with the reviewer that "the model was unsuccessful in one of the two patient-derived lines". The GOSB1 line demonstrated deficiency of neuron differentiation independently of neuroepithelial biomechanical function, whereas the GOSB2 line showed earlier failure of neuroepithelial function. We also do not, at this stage, make patient-specific predictive claims: this will require longer-term matching of cell model findings with patient phenotypes over the next 5-10 years.

      Reviewer #3: Major comments __ 1) One of my few concerns with this work is that the relative constriction of the apical surface with respect to the basal surface is not directly quantified for any of the experiments. This worry is slightly compounded by the 3D reconstructions Figure 1h, and the observation that overall cell volume is reduced and cell height increased simultaneously to area loss. Additionally, the net impact of apical constriction in tissues in vivo is to create local or global curvature change, but all the images in the paper suggest that the differentiated neural tissues are an uncurved monolayer even missing local buckles. I understand that these cells are grown on flat adherent surfaces limiting global curvature change, but is there evidence of localized buckling in the monolayer? While I believe-along with the authors-that their phenotypes are likely failures in apical constriction, I think they should work to strengthen this conclusion. I think the easiest way (and hopefully using data they already have) would be to directly compare apical area to basal area on a cell wise basis for some number of cells. Given the heterogeneity of cells, perhaps 30-50 cells per condition/line/mutant would be good? I am open to other approaches; this just seems like it may not require additional experiments.__

      As the reviewer observes, our cultures cannot bend because they are adhered on a rigid surface. The apical and basal lengths of the cultures will therefore necessarily be roughly equal in length. Some inwards bending of the epithelium is expected at the edges of the dish, but these cannot be imaged. The live imaging we show in Figure 2 illustrates that, just as happens in vivo, apical constriction is asynchronous. This means not all cells will have 'bottle' shapes in the same culture. We now illustrate the evolution of these shapes in more detail in Supplementary Figure 1 (shown in point 2.1 above).

      Additionally, the reviewer's comment motivated us to investigate local buckles in the apical surface of our cultures when their apical surfaces are dilated by ROCK inhibition. We hypothesised that the very straight apical surface in normal cultures is achieved by a balance of apical cell size and tension with pressure differences at the cell-liquid interface. Consistent with our expectation, the apical surface of ROCK-inhibited cultures becomes wrinkled (new Supplementary Figure 3). The VANGL2-KI lines do not develop this tortuous apical surface (as shown in Figure 3), which is to be expected given their modification is present throughout differentiation unlike the acute dilation caused by ROCK inhibition.

      This new data complements our visualisation of apical constriction in live imaging, apical accumulation of phospho-myosin, and quantification of ROCK-dependent apical tension as independent lines of evidence that our cultures undergo apical constriction.

      2) Another slight experimental concern I have regards the difference in laser ablation experiments detailed in Figure 3h-i from those of Figure 2d-e. It seems like WT recoil values in 3h-I are more variable and of a lower average than the earlier experiments and given that it appears significance is reached mainly by impact of the lower values, can the authors explain if this variability is expected to be due to heterogeneity in the tissue, i.e. some areas have higher local tension? If so, would that correspond with more local apical constriction?

      There is no significant difference in recoil between the control lines in Figures 2 and 3, albeit the data in Figure 3 is more variable (necessitating more replicates: none were excluded). We also showed laser ablation recoil data in Supplementary Figure 10, in which we did identify a graphing error (now corrected, also no significant difference in recoil from the other control groups).

      Minor comments __ 1) There seems to be a critical window at day 5 of the differentiation protocol, both in terms of cell morphology and the marker panel presented in Figure 1i. Do the authors have any data spanning the hours from day 5 to 6? If not, I don't think they need to generate any, but do I think this is a very interesting window worthy of further discussion for a couple of reasons. First, several studies of mouse neural tube closure have shown that various aspects of cell remodeling are temporally separable. For example, between Grego-Bessa et al 2016 and Brooks et al 2020 we can infer that apicobasal elongation rapidly increases starting at E8.5, whereas apical surface area reduction and constriction are apparent somewhat earlier at E8.0. I think it would be interesting to see if this separability is conserved in humans. Second, is there a sense of how the temporal correlation between the pluripotent and early neural fate marker data presented here corroborate or contradict the emerging set of temporally resolved RNA seq data sets of mouse development at equivalent early neural stages?__

      Cell shape analysis between days 5 and 6 has now been added (see the response to point 2.1 below). As the reviewer predicted, this is a transition point when apical area begins to decrease and apicobasal elongation begins to increase.

      We also thank the reviewer for this prompt to more closely compare our data to the previous mouse publications, which we have added to the discussion. The Grego-Bessa 2016 paper appears to show an increase in thickness between E7.75 and E8.5, but these are not statistically compared. Previous studies showed rapid apicobasal elongation during the period of neural fold elevation, when neuroepithelial cells apically constrict. This has now been added to the discussion:

      Discussion In mice, neuroepithelial apicobasal thickness is spatially-patterned, with shorter cells at the midline under the influence of SHH signalling14,77,78. Apicobasal thickness of the cranial neural folds increases from ~25 µm at E7.75 to ~50 µm at E8.579: closely paralleling the elongation between days 2 and 8 of differentiation in our protocol. The rate of thickening is non-uniform, with the greatest increase occurring during elevation of the neural folds80, paralleled in our model by the rapid increase in thickness between days 4-6 as apical areas decrease. Elevation requires neuroepithelial apical constriction and these cells' apical area also decreases between E7.75 and E8.5 in mice79, but we and others have recently shown that this reduction is both region and sex-specific14,81. Specifically, apical constriction occurs in the lateral (future dorsal) neuroepithelium: this corresponds with the identity of the cells generated by the dual SMAD inhibition model we use56. More recently, Brooks et al82 showed that the rapid reduction in apical area from E8-E8.5 is associated with cadherin switching from CDH1 (E-cadherin) to CDH2 (N-cadherin). This is also directly paralleled in our human system, which shows low-level co-expression of CDH1 and CDH2 at day 4 of differentiation, immediately before apical area shrinks and apicobasal thickness increases.

      Prompted by the in vivo data in Brooks et al (2025)82, we are keen to further explore the timing of CDH1/CDH2 switching versus apical constriction with new experimental data in revisions.

      2) Can the authors elaborate a bit more on what is known regarding apicobasal thickening and pseudo-stratification and how their work fits into the current understanding in the discussion? This is a very interesting and less well studied mechanism critical to closure, which their model is well suited to directly address. I am thinking mainly of the Grego-Bessa at al., 2016 work on PTEN, though interestingly the work of Ohmura et al., 2012 on the NUAK kinases also shows reduced tissue thickening (and apical constriction) and I am sure I have missed others. Given that the authors identify MED24 as a likely candidate for the lack of apicobasal thickening in one of their patient derived lines, is there any evidence that it interacts with any of the known players?

      We have now added further discussion on the mechanisms by which the neuroepithelium undergoes apicobasal elongation. Nuclear compaction is likely to be necessary to allow pseudostratification and apicobasal elongation. The reviewer's comment has led us to realise that diminished chromatin compaction is a potential outcome of MED24 down-regulation in our GOSB2 patient-derived line. Figure 4D suggests the nuclei of our MED24 deficient patient-derived line are less compacted than control equivalents and we propose to quantify nuclear volume in more detail to explore this possibility.

      Additionally, we have already expanded our discussion as suggested by the reviewer:

      Discussion: *Mechanistic separability of apical constriction and apicobasal elongation is consistent with biomechanical modelling of Xenopus neural tube closure showing that both are independently required for tissue bending61. Nonetheless, neuroepithelial apical constriction and apicobasal elongation are co-regulated in mouse models: for example, deletion of Nuak1/283, Cfl184, and Pten79 all produce shorter neuroepithelium with larger apical areas. Neuroepithelial cells of the GOSB2 line described here, which has partial loss of MED24, similarly produces a thinner neuroepithelium with larger apical areas. Although apical areas were not analysed in mouse models of Med24 deletion, these embryos also have shorter and non-pseudostratified neuroepithelium. *

      Our GOSB2 line - which retains readily detectable MED24 protein - is clearly less severe than the mouse global knockout, and the clinical features of the patient from which this line was derived are milder than the phenotype of Med24 knockout embryos68. Mouse embryos lacking one of Med24's interaction partners in the mediator complex, Med1, also have thinner neuroepithelium and diminished neuronal differentiation but successfully close their neural tube85. As general regulators of polymerase activity, MED proteins have the potential to alter the timing or level of expression of many other genes, including those already known to influence pseudostratification or apicobasal elongation. MED depletion also causes redistribution of cohesion complexes86 which may impact chromatin compaction, reducing nuclear volume during differentiation.

      3) Is there any indication that Vangl2 is weakly or locally planar polarized in this system? Figure 2F seems to suggest not, but Supplementary Figure 5 does show at least more supracellular cable like structures that may have some polarity. I ask because polarization seems to be one of the properties that differs along the anteroposterior axis of the neural plate, and I wonder if this offers some insight into the position along the axis that this system most closely models?

      VANGL2 does not appear to be planar polarised in this system. This is similar to the mouse spinal neuroepithelium, in which apical VANGL2 is homogenous but F-actin is planar polarised (Galea et al Disease Models and Mechanisms 2018). We do observe local supracellular cable-like enrichments of F-actin in the apical surface of iPSC-derived neuroepithelial cells. _We propose to compare the length of F-actin cables and coherency of their orientation at the start and end of neuroepithelial differentiation, and in wild-type versus VANGL2-mutant epithelia._

      4) I think some of the commentary on the strengths and limitations of the model found in the Results section should be collated and moved to the discussion in a single paragraph. For example ' This could also briefly touch on/compare to some of the other models utilizing hiPSCs (These are mentioned briefly in the intro, but this comparison could be elaborated on a bit after seeing all the great data in this work).

      These changes have now been made:

      __Discussion: __Some of these limitations, potentially including inclusion of environmental risk factors, can be addressed by using alternative iPSC-derived models93,94. For example, if patients have suspected causative mutations in genes specific to the surface (non-neural) ectoderm, such as GRHL2/3, 3D models described by Karzbrun et al49 or Huang et al95 may be informative. Characterisation of surface ectoderm behaviours in those models is currently lacking. These models are particularly useful for high-throughput screens of induced mutations95, but their reproducibility between cell lines, necessary to compare patient samples to non-congenic controls, remains to be validated. Spinal cell identities can be generated in human spinal cord organoids, although these have highly variable morphologies96,97. As such, each iPSC model presents limitations and opportunities, to which this study contributes a reductionist and highly reproducible system in which to quantitatively compare multiple neuroepithelial functions.

      5) While the authors are generally good about labeling figures by the day post smad inhibition, in some figures it is not clear either from the images or the legend text. I believe this includes supplemental figures 2,5,6,8, and 10 (apologies if I simply missed it in one or more of them)

      These have now been added.

      6) The legend for Figure 2 refers to a panel that is not present and the remaining panel descriptions are off by a letter. I'm guessing this is a versioning error as the text itself seems largely correct, but it may be good to check for any other similar errors that snuck in

      This has now been corrected.

      7) The cell outlines in Figure 3d are a bit hard to see both in print and on the screen, perhaps increase the displayed intensity?

      This has now been corrected.

      8) The authors show a fascinating piece of data in Supplementary Figure 1, demonstrating that nuclear volume is halved by day 8. Do they have any indication if the DNA content remains constant (e.g., integrated DAPI density)? I suppose it must, and this is a minor point in the grand scheme, but this represents a significant nuclear remodeling and may impact the overall DNA accessibility.

      We agree with the reviewer that the reduction in nuclear volume is important data both because it informs understanding of the reduction in total cell volume, and because it suggests active chromatin compaction during differentiation. Unfortunately, the thicker epithelium and superimposition of nuclei in the differentiated condition means the laser light path is substantially different, making direct comparisons of intensity uninterpretable. Additionally, the apical-most nuclei will mostly be in G2/M phase due to interkinetic nuclear migration. As such, the comparison of DAPI integrated density between epithelial morphologies would not be informative.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      This manuscript by Ampartzidis et al., significantly extends the human induced pluripotent stem cell system originally characterized by the same group as a tool for examining cellular remodeling during differentiation stages consistent with those of human neural tube closure (Ampartzidis et al., 2023). Given that there are no direct ways to analyze cellular activity in human neural tube closure in vivo, this model represents an important platform for investigating neural tube defects which are a common and deleterious human developmental disease. Here, the authors carefully test whether this system is robust and reproducible when using hiPSC cells from different donors and pluripotency induction methods and find that despite all these variables the cellular remodeling programs that occur during early neural differentiation are statistically equivalent, suggesting that this system is a useful experimental substrate. Additionally, the carefully selected donor populations suggest these aspects of human neural tube closure are likely to be robust to sexual dimorphism and to reasonable levels of human genetic background variation, though more fully testing that proposition would require significant effort and be beyond the scope of the current work. Subsequent to this careful characterization, the authors next tested whether this system could be used to derive specific insights into cell remodeling during early neural differentiation. First, they used a reverse genetics approach to knock in a human point mutation in the critical regulator of planar cell polarity and apical constriction, Vangl2. Despite being identified in a patient, this R353C variant has not been directly functionally tested in a human system. The authors find that this variant, despite showing normal expression and phospho-regulation, leads to defects consistent with a failure in apical constriction, a key cell behavior required to drive curvature change during cranial closure. Finally, the authors test the utility of their hiPSC platform to understand human patient-specific defects by differentiating cells derived from two clinical spina bifida patients. The authors identify that one of these patients is likely to have a significant defect in fully establishing early proneural identity as well as defects in apicobasal thickening. While early remodeling occurs normally in the other patient, the authors observe significant defects in later neuronal induction and maturation. In addition, using whole exome sequencing the authors identify candidate variant loci that could underly these defects.

      Major comments

      1) One of my few concerns with this work is that the relative constriction of the apical surface with respect to the basal surface is not directly quantified for any of the experiments. This worry is slightly compounded by the 3D reconstructions Figure 1h, and the observation that overall cell volume is reduced and cell height increased simultaneously to area loss. Additionally, the net impact of apical constriction in tissues in vivo is to create local or global curvature change, but all the images in the paper suggest that the differentiated neural tissues are an uncurved monolayer even missing local buckles. I understand that these cells are grown on flat adherent surfaces limiting global curvature change, but is there evidence of localized buckling in the monolayer? While I believe-along with the authors-that their phenotypes are likely failures in apical constriction, I think they should work to strengthen this conclusion. I think the easiest way (and hopefully using data they already have) would be to directly compare apical area to basal area on a cell wise basis for some number of cells. Given the heterogeneity of cells, perhaps 30-50 cells per condition/line/mutant would be good? I am open to other approaches; this just seems like it may not require additional experiments.

      2) Another slight experimental concern I have regards the difference in laser ablation experiments detailed in Figure 3h-i from those of Figure 2d-e. It seems like WT recoil values in 3h-I are more variable and of a lower average than the earlier experiments and given that it appears significance is reached mainly by impact of the lower values, can the authors explain if this variability is expected to be due to heterogeneity in the tissue, i.e. some areas have higher local tension? If so, would that correspond with more local apical constriction?

      Minor comments

      1) There seems to be a critical window at day 5 of the differentiation protocol, both in terms of cell morphology and the marker panel presented in Figure 1i. Do the authors have any data spanning the hours from day 5 to 6? If not, I don't think they need to generate any, but do I think this is a very interesting window worthy of further discussion for a couple of reasons. First, several studies of mouse neural tube closure have shown that various aspects of cell remodeling are temporally separable. For example, between Grego-Bessa et al 2016 and Brooks et al 2020 we can infer that apicobasal elongation rapidly increases starting at E8.5, whereas apical surface area reduction and constriction are apparent somewhat earlier at E8.0. I think it would be interesting to see if this separability is conserved in humans. Second, is there a sense of how the temporal correlation between the pluripotent and early neural fate marker data presented here corroborate or contradict the emerging set of temporally resolved RNA seq data sets of mouse development at equivalent early neural stages?

      2) Can the authors elaborate a bit more on what is known regarding apicobasal thickening and pseudo-stratification and how their work fits into the current understanding in the discussion? This is a very interesting and less well studied mechanism critical to closure, which their model is well suited to directly address. I am thinking mainly of the Grego-Bessa at al., 2016 work on PTEN, though interestingly the work of Ohmura et al., 2012 on the NUAK kinases also shows reduced tissue thickening (and apical constriction) and I am sure I have missed others. Given that the authors identify MED24 as a likely candidate for the lack of apicobasal thickening in one of their patient derived lines, is there any evidence that it interacts with any of the known players?

      3) Is there any indication that Vangl2 is weakly or locally planar polarized in this system? Figure 2F seems to suggest not, but Supplementary Figure 5 does show at least more supracellular cable like structures that may have some polarity. I ask because polarization seems to be one of the properties that differs along the anteroposterior axis of the neural plate, and I wonder if this offers some insight into the position along the axis that this system most closely models?

      4) I think some of the commentary on the strengths and limitations of the model found in the Results section should be collated and moved to the discussion in a single paragraph. For example ' This could also briefly touch on/compare to some of the other models utilizing hiPSCs (These are mentioned briefly in the intro, but this comparison could be elaborated on a bit after seeing all the great data in this work).

      5) While the authors are generally good about labeling figures by the day post smad inhibition, in some figures it is not clear either from the images or the legend text. I believe this includes supplemental figures 2,5,6,8, and 10 (apologies if I simply missed it in one or more of them)

      6) The legend for Figure 2 refers to a panel that is not present and the remaining panel descriptions are off by a letter. I'm guessing this is a versioning error as the text itself seems largely correct, but it may be good to check for any other similar errors that snuck in

      7) The cell outlines in Figure 3d are a bit hard to see both in print and on the screen, perhaps increase the displayed intensity?

      8) The authors show a fascinating piece of data in Supplementary Figure 1, demonstrating that nuclear volume is halved by day 8. Do they have any indication if the DNA content remains constant (e.g., integrated DAPI density)? I suppose it must, and this is a minor point in the grand scheme, but this represents a significant nuclear remodeling and may impact the overall DNA accessibility.

      Significance

      Overall, I am enthusiastic about this work and believe it represents a significant step forward in the effort to establish precision medicine approaches for diagnoses of the patient-specific causative cellular defects underlying human neural tube closure defects. This work systematizes an important and novel tool to examine the cellular basis of neural tube defects. While other hiPSC models of neural tube closure capture some tissue level dynamics, which this model does not, they require complex microfluidic approaches and have limited accessibility to direct imaging of cell remodeling. Comparatively, the relative simplicity of the reported model and the work demonstrating its tractability as a patient-specific and reverse genetic platform make it unique and attractive. This work will be of interest to a broad cross section of basic scientists interested in the cellular basis of tissue remodeling and/or the early events of nervous system development as well as clinical scientists interested in modeling the consequences of patient specific human genetic deficits identified in neural tube defect pregnancies.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      The authors' work focuses on studying cell morphological changes during differentiation of hPSCs into neural progenitors in a 2D monolayer setting. The authors use genetic mutations in VANGL2 and patient-derived iPSCs to show that (1) human phenotypes can be captured in the 2D differentiation assay, and (2) VANGL2 in humans is required for neural contraction, which is consistent with previous studies in animal models. The results are solid and convincing, the data are quantitative, and the manuscript is well written. The 2D model they present successfully addresses the questions posed in the manuscript. However, the broad impact of the model may be limited, as it does not contain NNE cells and does not exhibit tissue folding or tube closure, as seen in neural tube formation. Patient-derived lines are derived from amniotic fluid cells, and the experiments are performed before birth, which I find to be a remarkable achievement, showing the future of precision medicine.

      Major comments:

      1.Figure 1. The authors use F-actin to segment cell areas. Perhaps this could be done more accurately with ZO-1, as F-actin cables can cross the surface of a single cell. In any case, the authors need to show a measure of segmentation precision: segmented image vs. raw image plus a nuclear marker (DAPI, H2B-GFP), so we can check that the number of segmented cells matches the number of nuclei. 2.Lines 156-166. The authors claim that changes in gene expression precede morphological changes. I am not convinced this is supported by their data. Fig. 1g (epithelial thickness) and Fig. 1k (PAX6 expression) seem to have similar dynamics. The authors can perform a cross-correlation between the two plots to see which Δt gives maximum correlation. If Δt < 0, then it would suggest that gene expression precedes morphology, as they claim. Fig. 1j shows that NANOG drops before the morphological changes, but loss of NANOG is not specific to neural differentiation and therefore should not be related to the observed morphological changes. 3.Figure 2d. The laser ablation experiment in the presence of ROCK inhibitor is clear, as I can easily see the cell outlines before and after the experiment. In the absence of ROCK inhibitor, the cell edges are blurry, and I am not convinced the outline that the authors drew is really the cell boundary. Perhaps the authors can try to ablate a larger cell patch so that the change in area is more defined. 4.Figure 2d. Do the cells become thicker after recoil? 5.Figure 3. The authors mention their previous study in which they show that Vangl2 is not cell-autonomously required for neural closure. It will be interesting to study whether this also the case in the present human model by using mosaic cultures. 6.Lines 403-415. The authors report poor neural induction and neuronal differentiation in GOSB2. As far as I understand, this phenotype does not represent the in vivo situation. Thus, it is not clear to what extent the in vitro 2D model describes the human patient. 7.The experimental feat to derive cell lines from amniotic fluid and to perform experiments before birth is, in my view, heroic. However, I do not feel I learned much from the in vitro assays. There are many genetic changes that may cause the in vivo phenotype in the patient. The authors focus on MED24, but there is not enough convincing evidence that this is the key gene. I would like to suggest overexpression of MED24 as a rescue experiment, but I am not sure this is a single-gene phenotype. In addition, the fact that one patient line does not differentiate properly leads me to think that the patient lines do not strengthen the manuscript, and that perhaps additional clean mutations might contribute more.

      Minor comments:

      1.Figure 1c. Text is cropped at the edge of the image.

      Significance

      This study establishes a quantitative, reproducible 2D human iPSC-to-neural-progenitor platform for analyzing cell-shape dynamics during differentiation. Using VANGL2 mutations and patient-derived iPSCs, the work shows that (1) human phenotypes can be captured in a 2D differentiation assay and (2) VANGL2 is required for neural contraction (apical constriction), consistent with animal studies. The results are solid, the data are quantitative, and the manuscript is well written. Although the planar system lacks non-neural ectoderm and does not exhibit tissue folding or tube closure, it provides a tractable baseline for mechanistic dissection and genotype-phenotype mapping. The derivation of patient lines from amniotic fluid and execution of experiments before birth is a remarkable demonstration that points toward precision-medicine applications, while motivating rescue strategies and additional clean genetic models. However, overall I did not learn anything substantively new from this manuscript; the conclusions largely corroborate prior observations rather than extend them. In addition, the model was unsuccessful in one of the two patient-derived lines, which limits generalizability and weakens claims of patient-specific predictive value.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Recommendations for the authors):

      I think the authors did a fantastic job investigating the annotation issues I brought up in the first round. I am somewhat assured that the size of the dataset has prevented any real systematic issues from impacting their results. However, there are many clear underlying biases in the data, as the authors show, which could have a number of unexpected impacts on the results. For example, the consistently lower gene numbers could be biased towards certain types of genes or in certain lineages, making the CAZyme analysis unreliable. I do not agree with the author's choice to put these results in as a supplement with little or no other references to it in the main manuscript. Many of the conclusions that are drawn should be hedged by these findings. There should at least be a rational given for why the authors took the approach they did, such as mentioning the points they brought up in the response.

      We thank the reviewer for the positive assessment of our revision. We added text in the Discussion acknowledging limitations of the gene annotation approach. 

      “Because of the uniform yet simplified gene annotation approach, the total number of genes may be underestimated in some assemblies in our dataset, as observed when comparing the same species in JGI Mycocosm. Although this pattern is not biased toward any particular group of species, access to high-quality, well-annotated genomes could provide a clearer picture of the relative contributions of specific gene families.”

      We also added more text in the Methods (section "Sordariomycetes genomes") mentioning in more detail the investigation of potential biases related to assembly quality and annotation (with reference to Supplementary Results).

      A couple minor corrections:

      Figure 1C, both axes say PC1?

      Fixed.

      Figure S12, scales don't match so it's hard to compare, axis labels are inconsistent.

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      I congratulate the authors on the revision work. Their manuscript is very interesting and reads very well.

      I found several occurrences of « saprophyte ». Note that « saprotoph » is much better since fungi are not « phytes ».

      We thank the reviewer for positive feedback. The occurrences of “saprophytes” were corrected.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      van der Linden et al. report on the development of a new green-fluorescent sensor for calcium, following a novel rational design strategy based on the modification of the cyan-emissive sensor mTq2-CaFLITS. Through a mutational strategy similar to the one used to convert EGFP into EYFP, coupled with optimization of strategic amino acids located in proximity of the chromophore, they identify a novel sensor, GCaFLITS. Through a careful characterization of the photophysical properties in vitro and the expression level in cell cultures, the authors demonstrate that G-CaFLITS combines a large lifetime response with a good brightness in both the bound and unbound states. This relative independence of the brightness on calcium binding, compared with existing sensors that often feature at least one very dim form, is an interesting feature of this new type of sensors, which allows for a more robust usage in fluorescence lifetime imaging. Furthermore, the authors evaluate the performance of G-CaFLITS in different subcellular compartments and under two-photon excitation in Drosophila. While the data appears robust and the characterization thorough, the interpretation of the results in some cases appears less solid, and alternative explanations cannot be excluded.

      Strengths:

      The approach is innovative and extends the excellent photophysical properties of the mTq2-based to more red-shifted variants. While the spectral shift might appear relatively minor, as the authors correctly point out, it has interesting practical implications, such as the possibility to perform FLIM imaging of calcium using widely available laser wavelengths, or to reduce background autofluorescence, which can be a significant problem in FLIM.

      The screening was simple and rationally guided, demonstrating that, at least for this class of sensors, a careful choice of screening positions is an excellent strategy to obtain variants with large FLIM responses without the need of high-throughput screening.

      The description of the methodologies is very complete and accurate, greatly facilitating the reproduction of the results by others, or the adoption of similar methods. This is particularly true for the description of the experimental conditions for optimal screening of sensor variants in lysed bacterial cultures.

      The photophysical characterization is very thorough and complete, and the vast amount of data reported in the supporting information is a valuable reference for other researchers willing to attempt a similar sensor development strategy. Particularly well done is the characterization of the brightness in cells, and the comparison on multiple parameters with existing sensors.

      Overall, G-CaFLITS displays excellent properties for a FLIM sensor: very large lifetime change, bright emission in both forms and independence from pH in the physiological range.

      Weaknesses:

      The paper demonstrates the application of G-CaFLITS in various cellular subcompartments without providing direct evidence that the sensor's response is not affected by the targeting. Showing at least that the lifetime values in the saturated state are similar in all compartments would improve the robustness of the claims.

      In some cases, the interpretation of the results is not fully convincing, leaving alternative hypotheses as a possibility. This is particularly the case for the claim of the origin of the strongly reduced brightness of G-CaFLITS in Drosophila. The explanation of the intensity changes of G-CaFLITS also shows some inconsistency with the basic photophysical characterization.

      While the claims generally appear robust, in some cases they are conveyed with a lack of precision. Several sentences in the introduction and discussion could be improved in this regard. Furthermore, the use of the signal-to-noise ratio as a means of comparison between sensors appears to be imprecise, since it is dependent on experimental conditions.

      We thank the reviewer for a thorough evaluation and for suggestions to improve our manuscript. We are happy with the recognition of the strengths of this work. The list with weaknesses has several valid points which will be addressed in a point-by-point reply and a revision.

      Reviewer #2 (Public review):

      Summary:

      Van der Linden et al. describe the addition of the T203Y mutation to their previously described fluorescence lifetime calcium sensor Tq-Ca-FLITS to shift the fluorescence to green emission. This mutation was previously described to similarly red-shift the emission of green and cyan FPs. Tq-Ca-FLITS_T203Y behaves as a green calcium sensor with opposite polarity compared with the original (lifetime goes down upon calcium binding instead of up). They then screen a library of variants at

      two linker positions and identify a variant with slightly improved lifetime contrast (TqCa-FLITS_T203Y_V27A_N271D, named G-Ca-FLITS). The authors then characterize the performance of G-Ca-FLITS relative to Tq-Ca-FLITS in purified protein samples, in cultured cells, and in the brains of fruit flies.

      Strengths:

      This work is interesting as it extends their prior work generating a calcium indicator scaffold for fluorescent protein-based lifetime sensors with large contrast at a single wavelength, which is already being adopted by the community for production of other FLIM biosensors. This work effectively extends that from cyan to green fluorescence. While the cyan and green sensors are not spectrally distinct enough (~20-30nm shift) to easily multiplex together, it at least shifts the spectra to wavelengths that are more commonly available on commercial microscopes.

      The observations of organellar calcium concentrations were interesting and could potentially lead to new biological insight if followed up.

      Weaknesses:

      (1) The new G-Ca-FLITS sensor doesn't appear to be significantly improved in performance over the original Tq-Ca-FLITS, no specific benefits are demonstrated.

      (2) Although it was admirable to attempt in vivo demonstration in Drosophila with these sensors, depolarizing the whole brain with high potassium is not a terribly interesting or physiological stimulus and doesn't really highlight any advantages of their sensors; G-Ca-FLITS appears to be quite dim in the flies.

      We thank the reviewer for a thorough evaluation and for suggestions to improve our manuscript. Although the spectral shift of the green variant is modest, we have added new data (figure 7) to the manuscript that demonstrates multiplex imaging of G-Ca-FLITS and Tq-Ca-FLITS.

      As for the listed weaknesses we respond here:

      (1) Although we agree that the performance in terms of dynamic range is not improved, the advantage of the green sensor over the cyan version is that the brightness is high in both states.

      (2) We agree that the performance of G-Ca-FLITS is disappointing in Drosophila. We feel that this is important data to report, and it makes it clear that Tq-Ca-FLITS is a better choice for this system. Depolarization of the entire brain was done to measure the maximal lifetime contrast.

      Reviewer #3 (Public review):

      Summary:

      The authours present a variant of a previously described fluorescence lifetime sensor for calcium. Much of the manuscript describes the process of developing appropriate assays for screening sensor variants, and thorough characterization of those variants (inherent fluorescence characteristics, response to calcium and pH, comparisons to other calcium sensors). The final two figures show how the sensor performs in cultured cells and in vivo drosophila brains.

      Strengths:

      The work is presented clearly and the conclusion (this is a new calcium sensor that could be useful in some circumstances) is supported by the data.

      Weaknesses:

      There are probably few circumstances where this sensor would facilitate experiments (calcium measurements) that other sensors would prove insufficient.

      We thank the reviewer for the evaluation of our manuscript. As for the indicated weakness, we agree that the main application of genetically encoded calcium biosensors is to measure qualitative changes in calcium. However, it can be argued that due to a lack of tools the absolute quantification has been very challenging. Now, thanks to large contrast lifetime biosensors the quantitative measurements are simplified, there are new opportunities, and the probe reported here is an improvement over existing probes as it remains bright in both states, further improving quantitative calcium measurements.

      Reviewer #1 (Recommendations for the authors):

      While the science in the paper appears solid, the methods well grounded and excellently documented, the manuscript would benefit from a revision to improve the clarity of the exposition. In particular:

      Part of the introduction appears like a patchwork of information with poor logical consequentiality. The authors rapidly pass from the impact of brightness on FLIM accuracy, to mitochondrial calcium in pathology, to the importance of the sensor's affinity, to a sentence on sensor's kinetics, to fluorescent dyes and bioluminescence, to conclude that sensors should be stable at mitochondrial pH. I highly recommend rewriting this part.

      We thank the referee for the comment and we have adjusted to introduction to better connect the parts and increase the logic. The updated introduction addresses all the feedback by the reviewers on different aspects of the introductory text, and we have removed the section on dyes and bioluminescence. We feel that the introduction is better structured now.

      The reference to particular amino acid positions would greatly benefit from including images of the protein structure in which the positions are highlighted, similar to what the same authors do in their fluorescent protein development papers. While in the case of sensors a crystal structure might be lacking, highlighting the positions with respect to an AlphaFold-generated structure or the structure of mTq2 might still be helpful.

      We appreciate this remark and we have added a sequence alignment of the FLITS probes to supplemental Figure S4. This shows the residues with number, and we have also highlighted the different domains, linkers and mutations. We think that this linear representation works better than a 3D structure (one issue is that alphafold fails to display the chromophore and it has usually poor confidence for linker residues).

      The use of SNR, as defined by the authors (mean of the lifetime divided by standard deviation) appears a poorly suited parameter to compare sensors, as it depends on the total number of collected photons and on the strength of the algorithms used to retrieve the lifetime value. In an extreme example, if one would collect uniform images with millions of photons per pixel, most likely SNR would be extremely good for all sensors in all states, irrespective of the fact that some states are dimmer (within reasonable limits). On the other hand, if the same comparison would be performed at a level of thousands or hundreds of photons per pixel, the effect of different brightness on the SNR would be much more dramatic. While in general I fully agree with the core concept of the paper, i.e. that avoiding low-brightness forms leads more easily to experiments with higher SNR, I would suggest to stick to comparing the sensors in terms of brightness and refer to SNR (if needed) only when describing the consequences on measurements.

      The reviewer is right that in absolute terms the SNR is not meaningful. In addition to acquisition time, it depends on expression levels. Yet, it is possible to compare the change in SNR between the apo- and saturated states, and that is what is shown in figure 5. We have added text to better explain that the change in SNR is relevant here:

      “The absolute SNR is not relevant here, as it will depend on the expression level and acquisition time. But since we have measured the two extremes in the same cells, we can evaluate how the SNR changes between these states for each separate probe”

      Some statements from the authors or aspects of the paper appear problematic:

      (1) "Additionally, the fluorescence of most sensors is a non-linear function of calcium concentration, usually with Hill coefficients between 2 and 3. This is ideal when the probe is used as a binary detector for increases in Ca2+ concentrations, but it makes robust quantification of low, or even intermediate, calcium concentrations extremely challenging."

      To the best of my knowledge, for all sensors the fluorescence response is a nonlinear function of calcium concentrations. If the authors have specific examples in mind in which this is not true, they should cite them specifically. Furthermore, the Hill coefficient defines the range of concentrations in which the sensor operates, while the fact that "low concentrations" might be hard to detect depends only on the dim fluorescence of some sensors in the unbound form.

      We agree with the reviewer that this part is not clearly written and confusing, as the sentence “Additionally, the fluorescence of most sensors is a non-linear function of calcium concentration, usually with Hill coefficients between 2 and 3” was not relevant in this section and so we removed it. Now it reads:

      “Many GECIs harboring a single fluorescent protein (FP), like GCaMPs, are optimized for a large intensity change, and have a (very) dim state when calcium levels are below the KD of the probe (Akerboom et al., 2013; Dana et al., 2019; Shen et al., 2018; Zhang et al., 2023; Zhao et al., 2011). This is ideal when the probe is used as a binary detector for increases in Ca2+ concentrations, but it makes robust quantification of low, or even intermediate, calcium concentrations extremely challenging”

      (2) "The affinity of a sensor is of major importance: a low KD can underestimate high concentrations and vice versa."

      It is not clear to me why the concentrations would be underestimated, rather than just being less precise. Also, if a calibration curve is plotted in linear scale rather than logarithmic scale, it appears that the precision problem is much more severe near saturation (where low lifetime changes result in large concentration changes) than near zero (where low concentration changes produce large lifetime changes).

      We agree that this could be better explained, what we meant to say that concentrations that are ~10x lower or higher than the KD cannot be precisely measured. See also our reply to the next comment.

      (3) "Differences can also arise due to the method of calibration, i.e. when the absolute minimum and maximum signal are not reached in the calibration procedure (Fernandez-Sanz et al., 2019)."

      Unless better explained, this appears obvious and not worth mentioning.

      What may be obvious to the reviewer (and to us) may not be obvious to the reader, and that’s why this is included. To make it clearer we rephrased this part as a list of four items:

      “Accurate determination of the affinity of a sensor is important and there are several issues that need to be considered during the calibration and the measurements: (i) the concentrations can only be measured with sufficient precision when it is in the range between 10x K<sub>D</sub> and 1/10x K<sub>D</sub>, (ii) the calibration is only valid when the two extremes are reached during the calibration procedure (Fernandez-Sanz et al., 2019), (iii) the sensor’s kinetics should be sufficiently fast enough to be able to track the calcium changes, and (iv) the biosensor should be compatible with the high mitochondrial pH of 8 (Cano Abad et al., 2004; Llopis et al., 1998).”

      (4) In the experiments depicted in Figure 6C the underlying assumption is that the sensor behaves in the same way independently of the compartment to which it is targeted. This is not necessarily the case. It would be valuable to see the plots of Figure 6C and D discussed in terms of lifetime. Is the saturating lifetime value the same in all compartments?

      This is a valid point and we have now included a plot with the actual lifetime data for each of the organelles (figure S15). 

      We have also added text to discuss this point: “We note that the underlying assumption of the quantification of organellar calcium concentrations is that the lifetime contrast is the same. This is broadly true for most of the measurements (Figure S15). Yet, there are also differences. It is currently unclear whether the discrepancies are due to differences in the physicochemical properties of the compartments, or whether there is a technical reason (the efficiency of ionomycin for saturating the biosensor in the different compartments is unknown, as far as we know). This is something that is worth revisiting. A related issue that deserves attention is the level of agreement between in vitro and in vivo calibrations.”

      (5) A similar problem arises for the observation of different calcium levels in peripheral mitochondria. In figure S11b, the values of the two lifetime components of a biexponential fit are displayed. Both the long and short components seem to be different. This is an interesting observation, as in an ideal sensor (in which the "long lifetime conformation" is the same whether the sensor is bound to the analyte or not, and similarly for the short lifetime one) those values should be identical. While it is entirely possible that this is not the case for G-CaFLITS, since the authors have conducted a calibration experiment using time-domain FLIM, could they show the behavior of the lifetimes and preamplitudes? Are the trends consistent with their interpretation of a different calcium level in the two mitochondrial populations?

      We have analyzed the calibration data from TCSPC experiments done with the Leica Stellaris. From these data (acquired at high photon counts as it is purified protein in solution), we infer that both the short and long lifetime do change as a function of calcium concentration. In particular the long lifetime shows a substantial change, which we cannot explain at this moment. We agree that this is interesting and may potentially give insight in the conformation changes that give rise to the lifetime change.

      The lifetime data of the mitochondria has been acquired with a different FLIM setup, but the trend is consistent, both the long and short lifetime decrease in the peripheral mitochondria that have a higher calcium concentration.

      Author response image 1

      (6) "The lifetime response of Tq-Ca-FLITS and the ΔF/F response of jGCaMP7f resembled each other, with both signals gradually increasing over the span of 3-4 minutes after we increased external [K+]; the two signals then hit a plateau for ~1 min, followed by a return to baseline and often additional plateaus (Figure 8B-C). By comparison, G-Ca-FLITS responses were more variable, typically exhibiting a smaller ramping phase and seconds-long spikes of activity rather than minutes-long plateaus (Figure 8C)."

      This statement does not appear fully consistent with the data in Figure 8. While in figure 8B it looks like GCaMP and mTq-CaFLITS have very similar profiles, these curves come from one single experiment out of a very variable dataset (see Figure 8C). If one would for example choose the second curve of GCaMP in Figure 8C, it would look very similar to the response of G-CaFLITS in figure 8B, and the argument would be reversed. How do the averages look like?

      Indeed, the dynamics of the responses are very variable and we do not want to draw attention to these differences in the dynamics, so we have removed the comparison. Instead, the difference in intensity change and lifetime contrast are of importance here. To answer the question of the reviewer, we have added a new panel (D) which shows the average responses for each of the GECIs.  

      (7) "Although the calibration is equipment independent under ideal conditions, and only needs to be performed once, we prefer to repeat the calibration for different setups to account for differences in temperature or pulse frequency."

      While I generally agree with the statement, it is imprecise. A change in temperature is generally expected to affect the Kd, so rather than "preferring to repeat", it is a requirement for accurate quantification at different concentrations. I am not sure I understand what the pulse frequency is in this context, and how it affects the Kd.

      We thank the referee for pointing out that our text is imprecise and confusing. What we meant to say is that we see differences between different set-ups and we have clarified this by changing the text. We have also added that it is “necessary” to repeat the calibration:

      “Although the calibration is equipment independent under ideal conditions, and only needs to be performed once, we do see differences between different set-ups. Therefore, it is necessary to repeat the calibration for different set-ups.”

      (8) "A recent effort to generate a green emitting lifetime biosensor used a GFP variant as a template (Koveal et al., 2022), and the resulting biosensor was pH sensitive in the physiological range. On the other hand, biosensors with a CFP-like chromophore are largely pH insensitive (van der Linden et al., 2021; Zhong et al., 2024)."

      The dismissal of the use of T-Sapphire as a pH independent template is inaccurate. The same group has previously reported other sensors (SweetieTS for glucose and Peredox for redox ratio) that are not pH sensitive. Furthermore, in Koveal et al. also many of the mTq2-based variants showed a pH response, suggesting that the pHdependence for the Lilac sensor might be more complex. Still, G-CaFLITS present advantages in terms of the possibility to excite at longer wavelengths, which could be mentioned instead.

      We only want to make the point that adding the T203Y mutation to Turquoise-based lifetime biosensors may be a good approach for generating pH insensitive green biosensors. There is no point in dismissing other green biosensors and we have changed the text to: “Since biosensors with a CFP-like chromophore are largely pH insensitive (van der Linden et al., 2021; Zhong et al., 2024), and we show here that the pH independence is retained for the Green Ca-FLITS, we expect that adding the T203Y mutation to a cyan sensor is a good approach for generating pH-insensitive green lifetime-based sensors.”

      (9) "Usually, a higher QY results in a higher intensity; however, in G-Ca-FLITS the open state has a differential shaped excitation spectrum which leads to a decreased intensity. These effects combined have resulted in a sensor where the two different states have a similar intensity despite displaying a large QY and lifetime contrast."

      This statement does not seem to reflect the excitation spectra of Figure 1. If this explanation would be true, wouldn't there be an isoemissive point in the excitation spectrum (i.e. an excitation wavelength at which emission intensity would not change)?

      The excitation spectra in figure 1 are not ideal for the interpretation as these are not normalized. The normalized spectra are shown in figure S10, but for clarity we show the normalized spectra here below as well. For the FD-FLIM experiments we used a 446 nm LED that excites the calcium bound state more efficiently. Therefore, the lower brightness due to a lower QY of the calcium bound state is compensated by increased excitation. So the limited change in intensity is excitation wavelength dependent. We have added a sentence to the discussion to stress this:

      “The smallest intensity change is obtained when the calcium-bound state is preferably excited (i.e. near 450 nm) and the effect is less pronounced when the probe is excited near its peak at 474 nm”   

      (10) "We evaluated the use of Tq-Ca-FLITS and G-Ca-FLITS for 2P-FLIM and observed a surprisingly low brightness of the green variant in an intact fly brain. This result is consistent with a study finding that red-shifted fluorescent-protein variants that are much brighter under one-photon excitation are, surprisingly, dimmer than their blue cousins in multi-photon microscopy (Molina et al., 2017). The responses of both probes were in line with their properties in single photon FLIM, but given the low brightness of G-Ca-FLITS under 2-photon excitation, the Tq-Ca-FLITS may be a better choice for 2P-FLIM experiments."

      The differences appear strikingly high, and it seems improbable that a reduction in two-photon absorption coefficient might be the sole cause. How can the authors rule out a problem in expression (possibly organism-specific)?

      The reviewers are correct that the changes in brightness between G-Ca-FLITS and Tq-Ca-FLITS may arise from changes in expression levels. It is difficult to calibrate for these changes explicitly without a stable reference fluorophore. However, both the G-Ca-FLITS and Tq-Ca-FLITS transgenic flies produced used the same plasmid backbone (the Janelia 20x-UAS-IVS plasmid), landed in the same insertion site (VK00005) of the same genetic background and were crossed to the same Janelia driver line (R60D05-Gal4), so at the level of the transcriptional machinery or genetic regulatory landscape the two lines are probably identical except for the few base pair differences between the G-Ca-FLITS and Tq-Ca-FLITS sequence. But the same level of transcription may not correspond to the same amount of stable protein in the ellipsoid body. So, we cannot rule out any organism-specific problems in expression. To examine the 2P excitation efficiency relative to 1P excitation efficiency, we have measured the fluorescence intensity of purified G-Ca-FLITS and Tq-Ca-FLITS on beads. See also response to reviewer 3 and supplemental figure S14

      Suggestions

      (1) The underlying assumption of any experiment using a biosensor is that the concentration of the biosensor should be roughly 2 orders of magnitude lower than the concentration of the analyte, otherwise the calibration equations do not hold. When measuring nM concentrations of calcium, this problem can be in principle very significant, as the concentration of the sensor in cells is likely in the low micromolar range. Calcium regulation by the cell should compensate for the problem, and the equations should hold. However, this might not hold true during experimental conditions that would disrupt this tight regulation. It might be a good thing to add a sentence to inform users about the limitations in interpreting calcium concentration data under such conditions.

      Good point. We have added this to the discussion: “All calcium indicators also act as buffers, and this limits the accuracy of the absolute measurements, especially for the lower calcium concentrations (Rose et al., 2014), as the expression of the biosensor is usually in the low micromolar range.”

      (2) Different methods of lifetime "averaging", such as intensity or amplitude-weighted lifetime in time domain FLIM or phase and modulation in frequency domain might lead to different Kd in the same calibration experiment. This is an underappreciated factor that might lead to errors by users. Since the authors conducted calibrations using both frequency and time-domain, it would be useful to mention this fact and maybe add a table in the Supporting Information with the minima, maxima and Kds calculated using different lifetime averaging methods.

      To avoid biases due to fitting we prefer to use the phasor plot, this can be used for both frequency and time-domain methods and we added a sentence to the discussion to highlight this: “We prefer to use the phasor analysis (which can be used for both frequency- and time-domain FLIM), as it makes no assumptions about the underlying decay kinetics.”

      (3) The origin of the redshift observed in G-CaFLITS is likely pi-stacking, similar to the EGFP-to-EYFP case. While previous studies suggest that for mTq2 based sensors a change in rigidity would lead to a change in the non-radiative rate, which would result in similar changes in quantum yield and (amplitude-weighted average) lifetime. If pi-stacking plays a role, there could be an additional change in the radiative rate (as suggested also by the change in absorption spectra). Could this play a role in the relation between brightness and lifetime in G-CaFLITS? Given the extensive data collected by the authors, it should be possible to comment on these mechanistical aspects, which would be useful to guide future design.

      We do appreciate this suggestion, but we currently do not have the data to answer this question. The inverted response that we observe, solely due to the introduction of the tyrosine is puzzling. Perhaps introduction of the mutation that causes the redshift in other cyan probes will provide more insight.

      Reviewer #2 (Recommendations for the authors):

      Specific points:

      The first section of Results is basically a description of how they chose the lysis conditions for screening in bacteria. I didn't see anything particularly novel or interesting about this, anyone working with protein expression in bacteria likely needs to optimize growth, lysis, purification, etc. This section should be moved to the Methods.

      As reviewer 1 lists the thorough documentation of this approach as one of the strengths, we prefer to keep it like this. We see this section as method development, rather than purely a method. When this section would be moved to methods, it remains largely invisible and we think that’s a shame. Readers that are not interested can easily skip this section.

      In the Results section Characterization of G-Ca-FLITS, the authors state "Here, the calcium affinity was KD = 339 nM, higher compared to the calibration at 37{degree sign}C. This is in line with the notion that binding strength generally increases with decreasing temperature." However, the opposite appears to be true - at 37C they measured a KD of 209 nM which would represent higher binding strength at higher temperature.

      Thanks for catching this, we’ve made a mistake. We rephrase this to “higher compared to the calibration at 37 ˚C. This is unexpected as it not in line with the notion that binding strength generally increases with decreasing temperature.”

      In Figure 8c, there should be a visual indicator showing the onset of application of high potassium, as there is in 8b.

      This is a good suggestion; a grey box is added to indicates time when high K+ saline was perfused.

      Reviewer #3 (Recommendations for the authors):

      I think the science of the manuscript is sound and the presentation is logical and clear. I have some stylistic recommendations.

      Supp Fig 1: The figure requires a bit of "eyeballing" to decide which conditions are best, and figuring out which spectra matched the final conditions took a little effort. Is there a way to quantify the fluorescence yield to better show why the one set of conditions was chosen? If it was subjective, then at least highlight the final conditions with a box around the spectra, making it a different colour, or something to make it stand out.

      Thanks for the comment; we added a green box.

      Supp Fig 3: Similar suggestion. Highlight the final variant that was carried forward (T203Y). The subtle differences in spectra are hard to discern when they are presented separately. How would it look if they were plotted all on one graph? Or if each mutant were presented as a point on a graph of Peak Em vs Peak Ex? Would T203Y be in the top right?

      We have added a light blue box for reference to make the differences clearer.

      Supp Fig 4 & Fig 1: Too much of the graph show the uninteresting tails of the spectra and condenses the interesting part. Plotting from 400 nm to 600 nm would be more informative.

      We appreciate the suggestion but disagree. We prefer to show the spectra in its entirety, including the tails. The data will be available so other plots can be made by anyone.

      Fig 3a: People who are not experts in lifetime analysis are probably not very familiar with the phase/modulation polar plot. There should be an additional sentence or two in the main text that _briefly_ describes the basis for making the polar plot and the transformation to the fractional saturation plot in 3B. I can't think of a good way to transform Eq 3 from Supp Info into a sentence, but that's what I think is needed to make this transformation clearer.

      We appreciate the suggestion and feel that it is well explained here:

      "The two extreme values (zero calcium and 39 μM free calcium) are located on different coordinates in the polar plot and all intermediate concentrations are located on a straight line between these two extremes. Based on the position in the polar plot, we determined the fraction of sensor in the calcium-bound state, while considering the intensity contribution of both states"  

      Fig 4: The figure is great, and I love the comparison of different calcium sensors. But where is Tq-Ca-FLITS? I get that this is a figure of green calcium sensors, but it would be nice to see Tq-Ca-FLITS in there as well. The G-Ca-FLITS is compared to Tq-Ca-FLITS in Fig 5. Maybe I'm just missing why the bottom panel of Fig 5 cannot be replotted and included in Fig 4.

      The point is that we compare all the data with identical filter sets, i.e. for green FPs.using these ex/em settings, the Tq probe would seriously underperform. Note that the data in fig. 5 is not normalized to a reference RFP and can therefore not be compared with data presented in figure 4.

      Fig 6: The BOEC data could easily be moved to Supp Figs. It doesn't contribute much relevant info.

      We are not keen of moving data to supplemental, as too often the supplemental data is ignored. Moreover, we think that the BOEC data is valuable (as BOEC are primary cells and therefore a good model of a healthy human cell) and deserves a place in the main manuscript.

      2P FLIM / Fig 8 / Fig S4: The lack of brightness of G-Ca-FLITS in the 2P FLIM of fruit fly brain could have been predicted with a 2P cross section of the purified protein. If the equipment to perform such measurements is available, it could be incorporated into Fig S4.

      Unfortunately, we do not have access to equipment that measures the 2P cross section. As an alternative, we compared the 2P excitation efficiency with 1P excitation efficiency. To this end, we have used beads that were loaded with purified G-Ca-FLITS or Tq-Ca-FLITS. We have evaluated the fluorescence intensity of the beads using 1P (460 nm) and 2P (920 nm) excitation. Although the absolute intensity cannot be compared (the G-Ca-FLITS beads have a lower protein concentration), we can compare the relative intensities when changing from 1P to 2P. The 2P excitation efficiency of G-Ca-FLITS is comparable (if not better) to that of Tq-Ca-FLITS. This excludes the option that the G-Ca-FLITS has poor 2P excitability. We will include this data as figure S12.

      We also have added text to the results: “We evaluated the relative brightness of purified Tq-Ca-FLITS and G-Ca-FLITS on beads by either 1-Photon Excitation (1PE) (at 460 nm) or 2-Photon Excitation (2PE) (at 920 nm) and observed a similar brightness between the two modes of excitations (figure S14). This shows that the two probes have similar efficiencies in 2PE and suggest that the low brightness of GCa-FLITS in Drosophila is due to lower expression or poor folding.” and discussion: “The responses of both probes were in line with their properties in single photon FLIM, but given the low brightness of G-Ca-FLITS under 2-photon excitation in Drosphila, the Tq-Ca-FLITS is a better choice in this system. Yet, the brightness of G-Ca-FLITS with 2PE at 920 nm is comparable to Tq-Ca-FLITS, so we expect that 2P-FLIM with G-Ca-FLITS is possible in tissues that express it well.”

    1. Author Response:

      Reviewer #1 (Public Review):

      The work by Wang et al. examined how task-irrelevant, high-order rhythmic context could rescue the attentional blink effect via reorganizing items into different temporal chunks, as well as the neural correlates. In a series of behavioral experiments with several controls, they demonstrated that the detection performance of T2 was higher when occurring in different chunks from T1, compared to when T1 and T2 were in the same chunk. In EEG recordings, they further revealed that the chunk-related entrainment was significantly correlated with the behavioral effect, and the alpha-band power for T2 and its coupling to the low-frequency oscillation were also related to behavioral effect. They propose that the rhythmic context implements a second-order temporal structure to the first-order regularities posited in dynamic attention theory.

      Overall, I find the results interesting and convincing, particularly the behavioral part. The manuscript is clearly written and the methods are sound. My major concerns are about the neural part, i.e., whether the work provides new scientific insights to our understanding of dynamic attention and its neural underpinnings.

      1) A general concern is whether the observed behavioral related neural index, e.g., alpha-band power, cross-frequency coupling, could be simply explained in terms of ERP response for T2. For example, when the ERP response for T2 is larger for between-chunk condition compared to within-chunk condition, the alpha-power for T2 would be also larger for between-chunk condition. Likewise, this might also explain the cross-frequency coupling results. The authors should do more control analyses to address the possibility, e.g., plotting the ERP response for the two conditions and regressing them out from the oscillatory index.

      Many thanks for the comment. In short, the enhancement in alpha power and cross-frequency coupling results in the between-cycle condition compared with those in the within-cycle condition cannot be accounted for by the ERP responses for T2.

      In general, the rhythmic stimulation in the AB paradigm prevents EEG signals from returning to the baseline. Therefore, we cannot observe typical ERP components purely related to individual items, except for the P1 and N1 components related to the stream onset, which reveals no difference between the two conditions and are trailed by steady-state responses (SSRs) resonating at the stimulus rate (Fig. R1).

      Fig. R1. ERPs aligned to stream onset. EEG signals were filtered between 1–30 Hz, baseline-corrected (-200 to 0 ms before stream onset) and averaged across the electrodes in left parieto-occipital area where 10-Hz alpha power showed attentional modulation effect.

      To further inspect the potential differences in the target-related ERP signals between the within- and between-cycle conditions, we plotted the target-aligned waveforms for these experimental conditions. As shown in Fig. R2, a drop of ERP amplitude occurred for both conditions around T2 onset, and the difference between these two conditions was not significant (paired t-test estimated on mean amplitude every 20 ms from 0 to 700 ms relative to T1 onset, p > .05, FDR-corrected).

      Fig. R2. ERPs aligned to T1 onset. EEG signals were filtered between 1–30 Hz, and baseline-corrected using signals -100 to 0 ms before T1 onset. The two dash lines indicate the onset of T1 and T2, respectively.

      Since there is a trend of enhanced ERP response for the between-cycle relative to the within-cycle condition during the period of 0 to 100 ms after T2 onset (paired t-test on mean amplitude, p =.065, uncorrected), we then directly examined whether such post-T2 responses contribute to the behavioral attentional modulation effect and behavior-related neural indices. Crucially, we did not find any significant correlation of such T2-related ERP enhancement with the behavioral modulation index (BMI), or with the reported effects of alpha power and cross-frequency coupling (PAC). Furthermore, after controlling for the T2-related ERP responses, there still remains a significant correlation between the delta-alpha PAC and the BMI (rpartial = .596, p = .019), which is not surprising given that the PAC is calculated based on an 800-ms time window covering more pre-T2 than post-T2 periods (see the response to point #4 for details) rather than around the T2 onset. Taken together, these results clearly suggest that the T2-related ERP responses cannot explain the attentional modulation effect and the observed behavior-related neural indices.

      2) The alpha-band increase for T2 is indeed contradictory to the well known inhibitory function of alpha-band in attention. How could a target that is better discriminated elicit stronger inhibitory response? Related to the above point, the observed enhancement in alpha-band power and its coupling to low-frequency oscillation might derive from an enhanced ERP response for T2 target.

      Many thanks for the comment. We have briefly discussed this point in the revised manuscript (page 18, line 477).

      A widely accepted function of alpha activity in attention is that alpha oscillations suppress irrelevant visual information during spatial selection (Kelly et al., 2006; Thut et al., 2006; Worden et al., 2000). However, it becomes a controversial issue when there exists rhythmic sensory stimulation at alpha-band, just like the situation in the current study where both the visual stream and the contextual auditory rhythm were emitted at 10 Hz. In such a case, alpha-band neural responses at the stimulation frequency can be interpreted as either passively evoked steady-state responses (SSR) or actively synchronized intrinsic brain rhythms. From the former perspective (i.e., the SSR view), an increase in the amplitude or power at the stimulus frequency may indicate an enhanced attentional allocation to the stimulus stream that may result in better target detection (Janson et al., 2014; Keil et al., 2006; Müller & Hübner, 2002). Conversely, the latter view of the inhibitory function of intrinsic alpha oscillations would produce the opposite prediction. In a previous AB study, Janson and colleagues (2014) investigated this issue by separating the stimulus-evoked activity at 12 Hz (using the same power analysis method as ours) from the endogenous alpha oscillations ranging from 10.35 to 11.25 Hz (as indexed by individual alpha frequency, IAF). Interestingly, they found a dissociation between these two alpha-band neural responses, showing that the RSVP frequency power was higher in non-AB trials (T2 detected) than in AB trials (T2 undetected) while the IAF power exhibited the opposite pattern. According to these findings, the currently observed increase in alpha power for the between-cycle condition may reflect more of the stimulus-driven processes related to attentional enhancement. However, we don’t negate the effect of intrinsic alpha oscillations in our study, as the current design is not sufficient to distinguish between these two processes. We have discussed this point in the revised manuscript (page 18, line 477). Also, we have to admit that “alpha power” may not be the most precise term to describe our findings of the stimulus-related results. Thus, we have specified it as “neural responses to first-order rhythms at 10 Hz” and “10-Hz alpha power” in the revised manuscript (see page 12 in the Results section and page 18 in the Discussion section).

      As for the contribution of T2-related ERP response to the observed effect of 10 Hz power and cross-frequency coupling, please refer to our response to point #1.

      References:

      Janson, J., De Vos, M., Thorne, J. D., & Kranczioch, C. (2014). Endogenous and Rapid Serial Visual Presentation-induced Alpha Band Oscillations in the Attentional Blink. Journal of Cognitive Neuroscience, 26(7), 1454–1468. https://doi.org/10.1162/jocn_a_00551

      Keil, A., Ihssen, N., & Heim, S. (2006). Early cortical facilitation for emotionally arousing targets during the attentional blink. BMC Biology, 4(1), 23. https://doi.org/10.1186/1741-7007-4-23

      Kelly, S. P., Lalor, E. C., Reilly, R. B., & Foxe, J. J. (2006). Increases in Alpha Oscillatory Power Reflect an Active Retinotopic Mechanism for Distracter Suppression During Sustained Visuospatial Attention. Journal of Neurophysiology, 95(6), 3844–3851. https://doi.org/10.1152/jn.01234.2005

      Müller, M. M., & Hübner, R. (2002). Can the Spotlight of Attention Be Shaped Like a Doughnut? Evidence From Steady-State Visual Evoked Potentials. Psychological Science, 13(2), 119–124. https://doi.org/10.1111/1467-9280.00422

      Thut, G., Nietzel, A., Brandt, S., & Pascual-Leone, A. (2006). Alpha-band electroencephalographic activity over occipital cortex indexes visuospatial attention bias and predicts visual target detection. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 26(37), 9494–9502. https://doi.org/10.1523/JNEUROSCI.0875-06.2006

      Worden, M. S., Foxe, J. J., Wang, N., & Simpson, G. V. (2000). Anticipatory Biasing of Visuospatial Attention Indexed by Retinotopically Specific α-Bank Electroencephalography Increases over Occipital Cortex. Journal of Neuroscience, 20(6), RC63–RC63. https://doi.org/10.1523/JNEUROSCI.20-06-j0002.2000

      3) To support that it is the context-induced entrainment that leads to the modulation in AB effect, the authors could examine pre-T2 response, e.g., alpha-power, and cross-frequency coupling, as well as its relationship to behavioral performance. I think the pre-stimulus response might be more convincing to support the authors' claim.

      Many thanks for the insightful suggestion. We have conducted additional analyses.

      Following this suggestion, we have examined the 10-Hz alpha power within the time window of -100–0 ms before T2 onset and found stronger activity for the between-cycle condition than for the within-cycle condition. This pre-T2 response is similar to the post-T2 response except that it is more restricted to the left parieto-occipital cluster (CP3, CP5, P3, P5, PO3, PO5, POZ, O1, OZ, t(15) = 2.774, p = .007), which partially overlaps with the cluster that exhibits a delta-alpha coupling effect significantly correlated with the BMI. We have incorporated these findings into the main text (page 12, line 315) and the Fig. 5A of the revised manuscript.

      As for the coupling results reported in our manuscript, the coupling index (PAC) was calculated based on the activity during the second and third cycles (i.e., 400 to 1200 ms from stream onset) of the contextual rhythm, most of which covers the pre-T2 period as T2 always appeared in the third cycle for both conditions. Together, these results on pre-T2 10-Hz alpha power and cross-frequency coupling, as well as its relationship to behavioral performance, jointly suggest that the observed modulation effect is caused by the context-induced entrainment rather than being a by-product of post-T2 processing.

      4) About the entrainment to rhythmic context and its relation to behavioral modulation index. Previous studies (e.g., Ding et al) have demonstrated the hierarchical temporal structure in speech signals, e.g., emergence of word-level entrainment introduced by language experience. Therefore, it is well expected that imposing a second-order structure on a visual stream would elicit the corresponding steady-state response. I understand that the new part and main focus here are the AB effects. The authors should add more texts explaining how their findings contribute new understandings to the neural mechanism for the intriguing phenomena.

      Many thanks for the suggestion. We have provided more discussion in the revised manuscript (page 17, line 447).

      We have provided more discussion on this important issue in the revised manuscript (page 17, line 447). In brief, our study demonstrates how cortical tracking of feature-based hierarchical structure reframes the deployment of attentional resources over visual streams. This effect, distinct from the hierarchical entrainment to speech signals (Ding et al., 2016; Gross et al., 2013), does not rely on previously acquired knowledge about the structured information and can be established automatically even when the higher-order structure comes from a task-irrelevant and cross-modal contextual rhythm. On the other hand, our finding sheds fresh light on the adaptive value of the structure-based entrainment effect by expanding its role from rhythmic information (e.g., speech) perception to temporal attention deployment. To our knowledge, few studies have tackled this issue in visual or speech processing.

      References:

      Ding, N., Melloni, L., Zhang, H., Tian, X., & Poeppel, D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164. https://doi.org/10.1038/nn.4186

      Gross, J., Hoogenboom, N., Thut, G., Schyns, P., Panzeri, S., Belin, P., & Garrod, S. (2013). Speech Rhythms and Multiplexed Oscillatory Sensory Coding in the Human Brain. PLoS Biol, 11(12). https://doi.org/10.1371/journal.pbio.1001752

      Reviewer #2 (Public Review):

      In cognitive neuroscience, a large number of studies proposed that neural entrainment, i.e., synchronization of neural activity and low-frequency external rhythms, is a key mechanism for temporal attention. In psychology and especially in vision, attentional blink is the most established paradigm to study temporal attention. Nevertheless, as far as I know, few studies try to link neural entrainment in the cognitive neuroscience literature with attentional blink in the psychology literature. The current study, however, bridges this gap.

      The study provides new evidence for the dynamic attending theory using the attentional blink paradigm. Furthermore, it is shown that neural entrainment to the sensory rhythm, measured by EEG, is related to the attentional blink effect. The authors also show that event/chunk boundaries are not enough to modulate the attentional blink effect, and suggest that strict rhythmicity is required to modulate attention in time.

      In general, I enjoyed reading the manuscript and only have a few relatively minor concerns.

      1) Details about EEG analysis.

      . First, each epoch is from -600 ms before the stimulus onset to 1600 ms after the stimulus onset. Therefore, the epoch is 2200 s in duration. However, zero-padding is needed to make the epoch duration 2000 s (for 0.5-Hz resolution). This is confusing. Furthermore, for a more conservative analysis, I recommend to also analyze the response between 400 ms and 1600 ms, to avoid the onset response, and show the results in a supplementary figure. The short duration reduces the frequency resolution but still allows seeing a 2.5-Hz response.

      Thanks for the comments. Each epoch was indeed segmented from -600 to 1600 ms relative to the stimulus onset, but in the spectrum analysis, we only used EEG signals from stream onset (i.e., time point 0) to 1600 ms (see the Materials and Methods section) to investigate the oscillatory characteristics of the neural responses purely elicited by rhythmic stimuli. The 1.6-s signals were zero-padded into a 2-s duration to achieve a frequency resolution of 0.5 Hz.

      According to the reviewer’s suggestion, we analyzed the EEG signals from 400 ms to 1600 ms relative to stream onset to avoid potential influence of the onset response, and showed the results in Figure 4. Basically, we can still observe spectral peaks at the stimulus frequencies of 2.5, 5 (the harmonic of 2.5 Hz), and 10 Hz for both power and ITPC spectrum. However, the peak magnitudes were much weaker than those of 1.6-s signals especially for 2.5 Hz, and the 2.5-Hz power did not survive the multiple comparisons correction across frequencies (FDR threshold of p < .05), which might be due to the relatively low signal-to-noise ratio for the analysis based on the 1.2-s epochs (only three cycles to estimate the activity at 2.5 Hz). Importantly, we did identify a significant cluster for 2.5 Hz ITPC in the left parieto-occipital region showing a positive correlation with the individuals’ BMI (Fig. R3; CP5, TP7, P5, P7, PO5, PO7, O1; r = .538, p = .016), which is consistent with the findings based on the longer epochs.

      Fig. R3. Neural entrainment to contextual rhythms during the period of 400–1600 ms from stream onset. (A) The spectrum for inter-trial phase coherence (ITPC) of EEG signals from 400 to 1600 ms after the stimulus onset. Shaded areas indicate standard errors of the mean. (B) The 2.5-Hz ITPC was significantly correlated with the behavioral modulation index (BMI) in a parieto-occipital cluster, as indicated by orange stars in the scalp topographic map.

      Second, "The preprocessed EEG signals were first corrected by subtracting the average activity of the entire stream for each epoch, and then averaged across trials for each condition, each participant, and each electrode." I have several concerns about this procedure.

      (A) What is the entire stream? It's the average over time?

      Yes, as for the power spectrum analysis, EEG signals were first demeaned by subtracting the average signals of the entire stream over time from onset to offset (i.e., from 0 to 1600 ms) before further analysis. We performed this procedure following previous studies on the entrainment to visual rhythms (Spaak et al., 2014). We have clarified this point in the “Power analysis” part of the Materials and Methods section (page 25, line 677).

      References:

      Spaak, E., Lange, F. P. de, & Jensen, O. (2014). Local Entrainment of Alpha Oscillations by Visual Stimuli Causes Cyclic Modulation of Perception. The Journal of Neuroscience, 34(10), 3536–3544. https://doi.org/10.1523/JNEUROSCI.4385-13.2014

      (B) I suggest to do the Fourier transform first and average the spectrum over participants and electrodes. Averaging the EEG waveforms require the assumption that all electrodes/participants have the same response phase, which is not necessarily true.

      Thanks for the suggestion. In an AB paradigm, the evoked neural responses are sufficiently time-locked to the periodic stimulation, so it is reasonable to quantify power estimate with spectral decomposition performed on trial-averaged EEG signals (i.e., evoked power). Moreover, our results of inter-trial phase coherence (ITPC), which estimated the phase-locking value across trials based on single-trial decomposed phase values, also provided supporting evidence that the EEG waveforms were temporally locked across trials to the 2.5-Hz temporal structure in the context session.

      Nevertheless, we also took the reviewer’s suggestion seriously and analyzed the power spectrum on the average of single-trial spectral transforms, i.e., the induced power, which puts emphasis on the intrinsic non-phase-locked activities. In line with the results of evoked power and ITPC, the induced power spectrum in context session also peaked at 2.5 Hz and was significantly stronger than that in baseline session at 2.5 Hz (t(15) = 4.186, p < .001, FDR-corrected with a p value threshold < .001). Importantly, Person correlation analysis also revealed a positive cluster in the left parieto-occipital region, indicating the induced power at 2.5 Hz also had strong relevance with the attentional modulation effect (P7, PO7, PO5, PO3; r = .606, p = .006). We have added these additional findings to the revised manuscript (page 11, line 288; see also Figure 4—figure supplement 1).

      2) The sequences are short, only containing 16 items and 4 cycles. Furthermore, the targets are presented in the 2nd or 3rd cycle. I suspect that a stronger effect may be observed if the sequence are longer, since attention may not well entrain to the external stimulus until a few cycles. In the first trial of the experiment, they participant may not have a chance to realize that the task-irrelevant auditory/visual stimulus has a cyclic nature and it is not likely that their attention will entrain to such cycles. As the experiment precedes, they learns that the stimulus is cyclic and may allocate their attention rhythmically. Therefore, I feel that the participants do not just rely on the rhythmic information within a trial but also rely on the stimulus history. Please discuss why short sequences are used and whether it is possible to see buildup of the effect over trials or over cycles within a trial.

      Thanks for the comments. Typically, to induce a classic pattern of AB effect, the RSVP stream should contain 3–7 distractors before the first target (T1), with varying lengths of distractors (0–7) between two targets and at least 2 items after the second target (T2). In our study, we created the RSVP streams following these rules, which allowed us to observe the typical AB effect that T2 performance was deteriorated at Lag 2 relative to that at Lag 8. Nevertheless, we agree with the reviewer that longer streams would be better for building up the attentional entrainment effect, as we did observe the attentional modulation effect ramped up as the stream proceeded over cycles, consistent with the reviewer’s speculation. In Experiments 1a (using auditory context) and 2a (using color-defined visual context), we adopted two sets of target positions—an early one where T2 appeared at the 6th or 8th position (in the 2nd cycle) of the visual stream, and a late one where T2 appeared at the 10th or 12th position (in the 3rd cycle) of the visual stream. In the manuscript, we reported T2 performance with all the target positions combined, as no significant interaction was found between the target positions and the experimental conditions (ps. > .1). However, additional analysis demonstrated a trend toward an increase of the attentional modulation effect over cycles, from the early to the late positions. As shown in Fig. R4, the modulation effect went stronger and reached significance for the late positions (for Experiment 1a, t(15) = 2.83, p = .013, Cohen’s d = 0.707; for Experiment 2a, t(15) = 3.656, p = .002, Cohen’s d = 0.914) but showed a weaker trend for the early positions (for Experiment 1a, t(15) = 1.049, p = .311, Cohen’s d = 0.262; for Experiment 2a, t(15) = .606, p = .553, Cohen’s d = 0.152).

      Fig. R4. Attentional modulation effect built up over cycles in Experiments 1a & 2a. Error bars represent 1 SEM; * p<0.05, ** p<0.01.

      However, we did not observe an obvious buildup effect across trials in our study. The modulation effect of contextual rhythms seems to be a quick process that the effect is evident in the first quarter of trials in Experiment 1a (for, t(15) = 2.703, p = .016, Cohen’s d = 0.676) and in the second quarter of trials in Experiment 2a (for, t(15) = 2.478, p = .026, Cohen’s d = 0.620.

      3) The term "cycle" is used without definition in Results. Please define and mention that it's an abstract term and does not require the stimulus to have "cycles".

      Thanks for the suggestion. By its definition, the term “cycle” refers to “an interval of time during which a sequence of a recurring succession of events or phenomena is completed” or “a course or series of events or operations that recur regularly and usually lead back to the starting point” (Merriam-Webster dictionary). In the current study, we stuck to the recurrent and regular nature of “cycle” in general while defined the specific meaning of “cycle” by feature-based periodic changes of the contextual stimuli in each experiment (page 5, line 101; also refer to Procedures in the Materials and Methods section for details). For example, in Experiment 1a, the background tone sequence changed its pitch value from high to low or vice versa isochronously at a rate of 2.5 Hz, thus forming a rhythmic context with structure-based cycles of 400 ms. Note that we did not use the more general term “chunk”, because arbitrary chunks without the regularity of cycles are insufficient to trigger the attentional modulation effect in the current study. Indeed, the effect was eliminated when we replaced the rhythmic cycles with irregular chunks (Experiments 1d & 1e).

      4) Entrainment of attention is not necessarily related to neural entrainment to sensory stimulus, and there is considerable debate about whether neural entrainment to sensory stimulus should be called entrainment. Too much emphasis on terminology is of course counterproductive but a short discussion on these issues is probably necessary.

      Thanks for the comments. As commonly accepted, entrainment is defined as the alignment of intrinsic neuronal activity to the temporal structure of external rhythmic inputs (Lakatos et al., 2019; Obleser & Kayser, 2019). Here, we are interested in the functional roles of cortical entrainment to the higher-order temporal structure imposed on first-order sensory stimulation, and used the term entrainment to describe the phase-locking neural responses to such hierarchical structure following literature on auditory and visual perception (Brookshire et al., 2017; Doelling & Poeppel, 2015). In our study, the consistent results of power and ITPC have provided strong evidence that neural entrainment at the structure level (2.5 Hz) is significantly correlated with the observed attentional modulation effect. However, this does not mean that the entrainment of attention is necessarily associated with neural entrainment to sensory stimulus in a broader context, as attention may also be guided by predictions based on non-isochronous temporal regularity without requiring stimulus-based oscillatory entrainment (Breska & Deouell, 2017; Morillon et al._2016).

      On the other hand, there has been a debate about whether the neural alignment to rhythmic stimulation reflects active entrainment of endogenous oscillatory processes (i.e., induced activity) or a series of passively evoked steady-state responses (Keitel et al., 2019; Notbohm et al., 2016; Zoefel et al., 2018). The latter process is also referred to as “entrainment in a broad sense” by Obleser & Kayser (2019). Given that a presented rhythm always evokes event-related potentials, a better question might be whether the observed alignment reflects the entrainment of endogenous oscillations in addition to evoked steady-state responses. Here we attempted to tackle this issue by measuring the induced power, which emphasizes the intrinsic non-phase-locked activity, in addition to the phase-locked evoked power. Specifically, we quantified these two kinds of activities with the average of single-trial EEG power spectra and the power spectra of trial-averaged EEG signals, respectively, according to Keitel et al. (2019). In addition to the observation of evoked responses to the contextual structure, we also demonstrated an attention-related neural tracking of the higher-order temporal structure based on the induced power at 2.5 Hz (see Figure 4—figure supplement 1), suggesting that the observed attentional modulation effect is at least partially derived from the entrainment of intrinsic oscillatory brain activity. We have briefly discussed this point in the revised manuscript (page 17, line 460).

      References:

      Breska, A., & Deouell, L. Y. (2017). Neural mechanisms of rhythm-based temporal prediction: Delta phase-locking reflects temporal predictability but not rhythmic entrainment. PLOS Biology, 15(2), e2001665. https://doi.org/10.1371/journal.pbio.2001665

      Brookshire, G., Lu, J., Nusbaum, H. C., Goldin-Meadow, S., & Casasanto, D. (2017). Visual cortex entrains to sign language. Proceedings of the National Academy of Sciences, 114(24), 6352–6357. https://doi.org/10.1073/pnas.1620350114

      Doelling, K. B., & Poeppel, D. (2015). Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences, 112(45), E6233–E6242. https://doi.org/10.1073/pnas.1508431112

      Henry, M. J., Herrmann, B., & Obleser, J. (2014). Entrained neural oscillations in multiple frequency bands comodulate behavior. Proceedings of the National Academy of Sciences, 111(41), 14935–14940. https://doi.org/10.1073/pnas.1408741111

      Keitel, C., Keitel, A., Benwell, C. S. Y., Daube, C., Thut, G., & Gross, J. (2019). Stimulus-Driven Brain Rhythms within the Alpha Band: The Attentional-Modulation Conundrum. The Journal of Neuroscience, 39(16), 3119–3129. https://doi.org/10.1523/JNEUROSCI.1633-18.2019

      Lakatos, P., Gross, J., & Thut, G. (2019). A New Unifying Account of the Roles of Neuronal Entrainment. Current Biology, 29(18), R890–R905. https://doi.org/10.1016/j.cub.2019.07.075

      Morillon, B., Schroeder, C. E., Wyart, V., & Arnal, L. H. (2016). Temporal Prediction in lieu of Periodic Stimulation. Journal of Neuroscience, 36(8), 2342–2347. https://doi.org/10.1523/JNEUROSCI.0836-15.2016

      Notbohm, A., Kurths, J., & Herrmann, C. S. (2016). Modification of Brain Oscillations via Rhythmic Light Stimulation Provides Evidence for Entrainment but Not for Superposition of Event-Related Responses. Frontiers in Human Neuroscience, 10. https://doi.org/10.3389/fnhum.2016.00010

      Obleser, J., & Kayser, C. (2019). Neural Entrainment and Attentional Selection in the Listening Brain. Trends in Cognitive Sciences, 23(11), 913–926. https://doi.org/10.1016/j.tics.2019.08.004

      Zoefel, B., ten Oever, S., & Sack, A. T. (2018). The Involvement of Endogenous Neural Oscillations in the Processing of Rhythmic Input: More Than a Regular Repetition of Evoked Neural Responses. Frontiers in Neuroscience, 12. https://doi.org/10.3389/fnins.2018.00095

      Reviewer #3 (Public Review):

      The current experiment tests whether the attentional blink is affected by higher-order regularity based on rhythmic organization of contextual features (pitch, color, or motion). The results show that this is indeed the case: the AB effect is smaller when two targets appeared in two adjacent cycles (between-cycle condition) than within the same cycle defined by the background sounds. Experiment 2 shows that this also holds for temporal regularities in the visual domain and Experiment 3 for motion. Additional EEG analysis indicated that the findings obtained can be explained by cortical entrainment to the higher-order contextual structure. Critically feature-based structure of contextual rhythms at 2.5 Hz was correlated with the strength of the attentional modulation effect.

      This is an intriguing and exciting finding. It is a clever and innovative approach to reduce the attention blink by presenting a rhythmic higher-order regularity. It is convincing that this pulling out of the AB is driven by cortical entrainment. Overall, the paper is clear, well written and provides adequate control conditions. There is a lot to like about this paper. Yet, there are particular concerns that need to be addressed. Below I outline these concerns:

      1) The most pressing concern is the behavioral data. We have to ensure that we are dealing here with a attentional blink. The way the data is presented is not the typical way this is done. Typically in AB designs one see the T2 performance when T1 is ignored relative to when T1 has to be detected. This data is not provided. I am not sure whether this data is collected but if so the reader should see this.

      Many thanks for the suggestion. We appreciate the reviewer for his/her thoughtful comments. To demonstrate the AB effect, we did include two T2 lag conditions in our study (Experiments 1a, 1b, 2a, and 2b)—a short-SOA condition where T2 was located at the second lag of T1 (i.e., SOA = 200 ms), and a long-SOA condition where T2 appeared at the 8th lag of T1 (i.e., SOA = 800 ms). In a typical AB effect, T2 performance at short lags is remarkably impaired compared with that at long lags. In our study, we consistently replicated this effect across the experiments, as reported in the Results section of Experiment 1 (page 5, line 106). Overall, the T2 detection accuracy conditioned on correct T1 response was significantly impaired in the short-SOA condition relative to that in the long-SOA condition (mean accuracy > 0.9 for all experiments), during both the context session and the baseline session. More crucially, when looking into the magnitude of the AB effect as measured by (ACClong-SOA - ACCshort-SOA)/ACClong-SOA, we still obtained a significant attentional modulation effect (for Experiment 1a, t(15) = -2.729, p = .016, Cohen’s d = 0.682; for Experiment 2a, t(15) = -4.143, p <.001, Cohen’s d = 1.036) similar to that reflected by the short-SOA condition alone, further confirming that cortical entrainment effectively influences the AB effect.

      Although we included both the long- and short-SOA conditions in the current study, we focused on T2 performance in the short-SOA condition rather than along the whole AB curve for the following reasons. Firstly, for the long-SOA conditions, the T2 performance is at ceiling level, making it an inappropriate baseline to probe the attentional modulation effect. We focused on Lag 2 because previous research has identified a robust AB effect around the second lag (Raymond et al., 1992), which provides a reasonable and sensitive baseline to probe the potential modulation effect of the contextual auditory and visual rhythms. Note that instead of using multiple lags, we varied the length of the rhythmic cycles (i.e., a cycle of 300 ms, 400 ms, and 500 ms corresponding to a rhythm frequency of 3.3 Hz, 2.5 Hz, and 2 Hz, respectively, all within the delta band), and showed that the attentional modulation effect could be generalized to these different delta-band rhythmic contexts, regardless of the absolute positions of the targets within the rhythmic cycles.

      As to the T1 performance, the overall accuracy was very high, ranging from 0.907 to 0.972, in all of our experiments. The corresponding results have been added to the Results section of the revised manuscript (page 5, line 103). Notably, we did not find T1-T2 trade-offs in most of our experiments, except in Experiment 2a where T1 performance showed a moderate decrease in the between-cycle condition relative to that in the within-cycle condition (mean ± SE: 0.888 ± 0.026 vs. 0.933 ± 0.016, respectively; t(15) = -2.217, p = .043). However, by examining the relationship between the modulation effects (i.e., the difference between the two experimental conditions) on T1 and T2, we did not find any significant correlation (p = .403), suggesting that the better performance for T2 was not simply due to the worse performance in detecting T1.

      Finally, previous studies have shown that ignoring T1 would lead to ceiling-level T2 performance (Raymond et al., 1992). Therefore, we did not include such manipulation in the current study, as in that case, it would be almost impossible for us to detect any contextual modulation effect.

      References:

      Raymond, J. E., Shapiro, K. L., & Arnell, K. M. (1992). Temporary suppression of visual processing in an RSVP task: An attentional blink? Journal of Experimental Psychology: Human Perception and Performance, 18(3), 849–860. https://doi.org/10.1037/0096-1523.18.3.849

      2) Also, there is only one lag tested. The ensure that we are dealing here with a true AB I would like to see that more than one lag is tested. In the ideal situation a full AB curve should be presented that includes several lags. This should be done for at least for one of the experiments. It would be informative as we can see how cortical entrainment affects the whole AB curve.

      Many thanks for the suggestion. Please refer to our response to the point #1 for “Reviewer #3 (Public Review)”. In short, we did include two T2 lag conditions in our study (Experiments 1a, 1b, 2a and 2b), and the results replicated the typical AB effect. We have clarified this point in the revised manuscript (page 5, line 106).

      3) Also, there is no data regarding T1 performance. It is important to show that this the better performance for T2 is not due to worse performance in detecting T1. So also please provide this data.

      Many thanks for the suggestion. Please refer to our response to the point #1 or “Reviewer #3 (Public Review)”. We have reported the T1 performance in the revised manuscript (page 5, line 103), and the results didn’t show obvious T1-T2 trade-offs.

      4) The authors identify the oscillatory characteristics of EEG signals in response to stimulus rhythms, by examined the FFT spectral peaks by subtracting the mean power of two nearest neighboring frequencies from the power at the stimulus frequency. I am not familiar with this procedure and would like to see some justification for using this technique.

      According to previous studies (Nozaradan, 2011; Lenc e al., 2018), the procedure to subtract the average amplitude of neighboring frequency bins can remove unrelated background noise, like muscle activity or eye movement. If there were no EEG oscillatory responses characteristic of stimulus rhythms, the amplitude at a given frequency bin should be similar to the average of its neighbors, and thus no significant peaks could be observed in the subtracted spectrum.

      References:

      Lenc, T., Keller, P. E., Varlet, M., & Nozaradan, S. (2018). Neural tracking of the musical beat is enhanced by low-frequency sounds. Proceedings of the National Academy of Sciences, 115(32), 8221–8226. https://doi.org/10.1073/pnas.1801421115

      Nozaradan, S., Peretz, I., Missal, M., & Mouraux, A. (2011). Tagging the Neuronal Entrainment to Beat and Meter. The Journal of Neuroscience, 31(28), 10234–10240. https://doi.org/10.1523/JNEUROSCI.0411-11.2011

    1. Author Response

      Summary:

      This work is of interest because it increases our understanding of the molecular mechanisms that distinguish subtypes of VIP interneurons in the cerebral cortex and because of the multiple ways in which the authors address the role of Prox1 in regulating synaptic function in these cells.

      The authors would like to thank the reviewers for their constructive comments. In response, we would like to clarify a number of issues, as well as outline how we plan to resolve major concerns.

      Reviewer #1:

      Stachiak and colleagues examine the physiological effects of removing the homeobox TF Prox1 from two subtypes of VIP neurons, defined on the basis of their bipolar vs. multipolar morphology.

      The results will be of interest to those in the field, since it is known from prior work that VIP interneurons are not a uniform class and that Prox1 is important for their development.

      The authors first show that selective removal of a conditional Prox1 allele using a VIP cre driver line results in a change in paired pulse ratio of presumptive excitatory synaptic responses in multipolar but not bipolar VIP interneurons. The authors then use RNA-seq to identify differentially expressed genes that might contribute and highlight a roughly two-fold reduction in the expression of a transcript encoding a trans-synaptic protein Elfn1 known to contribute to reduced glutamate release in Sst+ interneurons. They then test the potential contribution of Elfn1 to the phenotype by examining whether loss of one allele of Elfn1 globally alters facilitation. They find that facilitation is reduced both by this genetic manipulation and by a pharmacological blockade of presynaptic mGluRs known to interact with Elfn1.

      Although the results are interesting, and the authors have worked hard to make their case, the results are not definitive for several reasons:

      1) The global reduction of Elfn1 may act cell autonomously, or may have other actions in other cell types. The pharmacological manipulation is less subject to this interpretation, but these results are not as convincing as they could be because the multipolar Prox1 KO cells (Fig. 3 J) still show substantial facilitation comparable, for example to the multipolar control cells in the Elfn1 Het experiment (controls in Fig. 3E). This raises a concern about control for multiple comparisons. Instead of comparing the 6 conditions in Fig 3 with individual t-tests, it may be more appropriate to use ANOVA with posthoc tests controlled for multiple comparisons.

      The reviewer’s concerns regarding non-cell-autonomous actions of global Elfn1 KO are well founded. Significant phenotypic alterations have previously been reported, both in the physiology of SST neurons as well in the animals’ behavior (Stachniak, Sylwestrak, Scheiffele, Hall, & Ghosh, 2019; Tomioka et al., 2014). The homozygous Elfn1 KO mouse displays a hyperactive phenotype and epileptic activity after 3 months of age, suggesting generalcortical activity differences exist (Dolan & Mitchell, 2013; Tomioka et al., 2014). Nevertheless, we have not observed such changes in P17-21 Elfn1 heterozygous (Het) animals.

      Comparing across different experimental animal lines, for example the multipolar Prox1 KO cells (Fig. 3 J) to the multipolar control cells in the Elfn1 Het experiment (controls in Fig. 3E), is in our view not advisable. There is a plethora of examples in the literature on the effect of mouse strain on even the most basic cellular functions and hence it is always expected that researchers use the correct control animals for their experiments, which in the best case scenario are littermate controls. For these reasons, we would argue that statistical comparisons across mouse lines is not ideal for our study. Elfn1 Het and MSOP data are presented side by side to illustrate that Elfn1 Hets (3C,E) phenocopy the effects of Prox1 deletion (3G,H,I,J). (See also point 3) MSOP effect sizes, however, do show significant differences by ANOVA with Bonferroni post-hoc (normalized change in EPSC amplitude; multipolar prox1 control: +12.1 ± 3.8%, KO: -8.4 ± 4.3%, bipolar prox1 control: -5.2 ± 4.3%, KO: -3.4 ± 4.7%, cell type x genotype interaction, p= 0.02, two way ANOVA).

      2) The isolation of glutamatergic currents is not described. Were GABA antagonists present to block GABAergic currents? Especially with the Cs-based internal solutions used, chloride reversal potentials can be somewhat depolarized relative to the -65 mV holding potential. If IPSCs were included it would complicate the analysis.

      No, in fact GABA antagonists were not present in these experiments. The holding voltage in our evoked synaptic experiments is -70 mV, which combined with low internal [Cl-] makes it highly unlikely that the excitatory synaptic responses we study are contaminated by GABA-mediated ones, even with a Cs MeSO4-based solution. Nevertheless, we have now performed additional experiments where glutamate receptor blockers were applied in bath and we observe a complete blockade of the synaptic events at -70mV proving that they are AMPA/NMDA receptor mediated. When holding the cell at 0mV with these blockers present, outward currents were clearly visible, suggesting intact GABA-mediated events.

      3) The assumption that protein levels of Elfn1 are reduced to half in the het is untested. Synaptic proteins can be controlled at the level of translation and trafficking and WT may not have twice the level of this protein.

      We thank reviewer for pointing this out. Our rationale for using the Elfn1 heterozygous animals is rather that transcript levels are reduced by half in heterozygous animals, to match the reduction we found in the mRNA levels of VIP Prox1 KO cells (Fig 2). The principle purpose of the Elfn1 KO experiment was to determine whether the change in Elfn1 transcript levels could be sufficient to explain the synaptic deficit observed in VIP Prox1 KO cells. As the reviewer notes, translational regulation and protein trafficking could ultimately result in even larger changes than 0.5x protein levels at the synapse. This may ultimately explain the observed multipolar/bipolar disparity, which cannot be explained by transcriptional regulation alone (Fig 4).

      4) The authors are to be commended for checking whether Elfn1 is regulated by Prox1 only in the multipolar neurons, but unfortunately it is not. The authors speculate that the selective effects reflect a selective distribution of MgluR7, but without additional evidence it is hard to know how likely this explanation is.

      Additional experiments are underway to better understand this mechanism.

      Reviewer #2:

      Stachniak et al., provide an interesting manuscript on the postnatal role of the critical transcription factor, Prox1, which has been shown to be important for many developmental aspects of CGE-derived interneurons. Using a combination of genetic mouse lines, electrophysiology, FACS + RNAseq and molecular imaging, the authors provide evidence that Prox1 is genetically upstream of Elfn1. Moreover, they go on to show that loss of Prox1 in VIP+ cells preferentially impacts those that are multipolar but not the bipolar subgroup characterized by the expression of calretinin. This latter finding is very interesting, as the field is still uncovering how these distinct subgroups emerge but are at a loss of good molecular tools to fully uncover these questions. Overall, this is a great combination of data that uses several different approaches to come to the conclusions presented. I have suggestions that I think would strengthen the manuscript:

      1) Can the authors add a supplemental table showing the top 20-30 genes up and down regulated in their Prox1 KOS? This would make these, and additional, data more tenable to readers.

      We would be happy to provide supplementary tables with candidate genes at both P8 and P12.

      2) It is interesting that loss of Prox1 or Elfn1 leads to phenotypes in multipolar but are not present or mild in bipolar VIP+ cells. The authors test different hypotheses, which they are able to refute and discuss some ideas for how multipolar cells may be more affected by loss of Elfn1, even when the transcript is lost in both multipolar and bipolar after Prox1 deletion. If there is any way to expand upon these ideas experimentally, I believe it would greatly strengthen the manuscript. I understand there is no perfect experiment due to a lack of tools and reagents but if there is a way to develop one of the following ideas or something similar, it would be beneficial:

      We thank the reviewer for the note.

      a) Would it be possible to co-fill VIPCre labeled cells with biocytin and a retroviral tracer? Then, after the retroviral tracer had time to label a presynaptic cell, assess whether these were preferentially different between bipolar and multipolar cell types, the latter morphology determined by the biocytin fill? This would test whether each VIP+ subtype is differentially targeted.

      Although this is a very elegant experiment and we would be excited to do it, we do feel that single-cell rabies virus tracing is technically very challenging and will take many months to troubleshoot before being able to acquire good data. Hence, we think it is beyond the scope of this study.

      b) Another biocytin possibility would be to trace filled VIP+ cells and assess whether the dendrites of multipolar and bipolar cells differentially targeted distinct cortical lamina and whether these lamina, in the same section or parallel, were enriched for mGluR7+ afferents.

      We thank the reviewer for their suggestion and we are planning on doing these kinds of experiments.

      Reviewer #3:

      In this work Stachiak and colleagues investigate the role of Prox1 on the development of VIP cells. Prox1 is expressed by the majority of GABAergic derived from the caudal ganglionic eminence (CGE), and as mentioned by the authors, Prox1 has been shown to be necessary for the differentiation, circuit integration, and maintenance of CGE-derived GABAergic cells. Here, Stachiak and colleagues show that removal of Prox1 in VIP cells leads to suppression of synaptic release probability onto cortical multipolar VIP cells in a mechanism dependent on Elfn1. This work is of interest for the field because it increases our understanding of differential synaptic maturation of VIP cells. The results are noteworthy, however the relevance of this manuscript would potentially be increased by addressing the following suggestions:

      1) Include histology to show when exactly Prox1 is removed from multipolar and bipolar VIP-expressing cells by using the VIP-Cre mouse driver.

      We can address this by performing an in-situ hybridization against Prox1 from P3 onwards (when Cre becomes active).

      2) Clarify if the statistical analysis is done using n (number of cells) or N (number of animals). The analysis between control and mutants (both Prox1 and Elfn1) need to be done across animals and not cells.

      Statistics for physiology were done across n (number of cells) while statistics for ISH are done across number of slices. We will clarify this point in the text and update the methods.

      Regarding the statistics for the ISH, these have been done across n (number of slices) for control versus KO tissue (N = 3 and N = 2 animals, respectively). We will add more animals to this analysis to compare by animal instead, although we do not expect any change in the results.

      Regarding the physiology, we would provide a two-pronged answer. We first of all feel that averaging synaptic responses for each animal would hide a good deal of the biological variability in PPR present in different cells (response Fig 1), the characterization of which is integral to the central findings of the paper. Secondly, to perform such analysis asked by the reviewer one would need to obtain recordings from ~10 animals or so per condition for each condition, which, to our knowledge, is something that is not standard when utilizing in vitro electrophysiological recordings from single cells. For example, in these very recent studies that have performed in vitro electrophysiological recordings all the statistics are performed using “n” number of cells and not the average of all the cells recorded per animal collapsed into a single data point. (Udakis, Pedrosa, Chamberlain, Clopath, & Mellor, 2020) https://www.nature.com/articles/s41467-020-18074-8

      (Horvath, Piazza, Monteggia, & Kavalali, 2020) https://elifesciences.org/articles/52852

      (Haas et al., 2018) https://elifesciences.org/articles/31755

      Nevertheless, we have now re-run the analysis grouping the cells and averaging the values we get per animal, since we have obtained our data from many animals. The results are more or less indistinguishable from the ones presented in the original submission, except for on p value that rose to 0.07 from 0.03 due to the lack of the required number of animals. We hope that the new plots and statistics presented herein address the concern put forward by the reviewer.

      Response Fig 1: A comparison of cell wise versus animal-wise analysis of synaptic physiology. Some cell to cell variability is hidden, and the reduction in numbers impacts the P values.

      (A) PPR of multipolar Prox1 Control for 14 cells from 9 animals (n/N=14/9) under baseline conditions and with MSOP, cell-wise comparison p = 0.02 , t = 2.74 and (B) animal-wise comparisons (p = 0.04, t stat = 2.45). Statistics: paired t-test.

      (C) PPR of multipolar Prox1 KO cells (n/N=9/8) under baseline conditions and with MSOP, cell-wise comparison p = 0.2, t = 1.33 and (D) animal-wise comparisons (p = 0.2, t stat = 1.56). Statistics: paired t-test. Comparisons for PPR of bipolar Prox1 Control (n/N=8/8) and KO cells (n/N=9/9) did not change.

      (E) PPR for Prox1 control (n/N=18/11) and KO (n/N=13/11) bipolar VIP cells, cell-wise comparison p = 0.3, t = 1.1 and (F) animal-wise comparisons (p = 0.4, t stat = 0.93). Statistics: t-test.

      (G) PPR of Elfn1 Control (n/N=12/4) and Het (n/N=12/4) bipolar VIP cells, cell-wise comparison p = 0.3, t = 1.06 and (H) animal-wise comparisons (p = 0.4, t stat = 0.93)

      (I) PPR of Prox1 control (n/N=33/18) and KO (n/N=19/14) multipolar VIP cells, cell-wise comparison p = 0.03, t = 2.17. and (J) animal-wise comparisons (p = 0.07, t stat = 1.99).

      (K) PPR of Elfn1 Control (n/N=14/6) and Het (n/N=20/8) multipolar VIP cells, cell-wise comparison p = 0.008, t = 2.84 and (L) animal-wise comparisons (p = 0.007, t stat = 3.23).

      3) Clarify what are the parameters used to identify bipolar vs multipolar VIP cells. VIP cells comprise a wide variety of transcriptomic subtypes, and in the absence of using specific genetic markers for the different VIP subtypes, the authors should either include the reconstructions of all recorded cells or clarify if other methods were used.

      We thank the reviewer for this comment. The cell parameter criteria will be amended in the methods: “Cell type was classified as bipolar vs. multipolar based on cell body morphology (ovoid vs. round) and number and orientation of dendritic processes emanating from it (2 or 3 dendrites perpendicular to pia (for bipolar) vs. 3 or more processes in diverse orientations (for multipolar). In addition, the laminar localization of the two populations differs, with multipolar cells found primarily in the upper layer 2, while bipolar cells are found throughout layers 2 and 3. Initial determination of cell classification was made prior to patching fluorescent-labelled cells, but whenever possible this initial assessment was confirmed with post-hoc verification of biocytin filled cells.”

      Reference:

      Dolan, J., & Mitchell, K. J. (2013). Mutation of Elfn1 in Mice Causes Seizures and Hyperactivity. PLOS ONE, 8(11), e80491. Retrieved from https://doi.org/10.1371/journal.pone.0080491

      Haas, K. T., Compans, B., Letellier, M., Bartol, T. M., Grillo-Bosch, D., Sejnowski, T. J., … Hosy, E. (2018). Pre-post synaptic alignment through neuroligin-1 tunes synaptic transmission efficiency. ELife, 7, e31755. https://doi.org/10.7554/eLife.31755

      Horvath, P. M., Piazza, M. K., Monteggia, L. M., & Kavalali, E. T. (2020). Spontaneous and evoked neurotransmission are partially segregated at inhibitory synapses. ELife, 9, e52852. https://doi.org/10.7554/eLife.52852

      Stachniak, T. J., Sylwestrak, E. L., Scheiffele, P., Hall, B. J., & Ghosh, A. (2019). Elfn1-Induced Constitutive Activation of mGluR7 Determines Frequency-Dependent Recruitment of Somatostatin Interneurons. The Journal of Neuroscience, 39(23), 4461 LP – 4474. https://doi.org/10.1523/JNEUROSCI.2276-18.2019

      Tomioka, N. H., Yasuda, H., Miyamoto, H., Hatayama, M., Morimura, N., Matsumoto, Y., … Aruga, J. (2014). Elfn1 recruits presynaptic mGluR7 in trans and its loss results in seizures. Nature Communications. https://doi.org/10.1038/ncomms5501

      Udakis, M., Pedrosa, V., Chamberlain, S. E. L., Clopath, C., & Mellor, J. R. (2020). Interneuron-specific plasticity at parvalbumin and somatostatin inhibitory synapses onto CA1 pyramidal neurons shapes hippocampal output. Nature Communications, 11(1), 4395. https://doi.org/10.1038/s41467-020-18074-8

    1. Author Response

      Reviewer #1:

      The Lambowitz group has developed thermostable group II intron reverse transcriptases (TGIRTs) that strand switch and also have trans-lesion activity to provide a much wider view of RNA species analyzed by massively parallel RNA sequencing. In this manuscript they use several improvements to their methodology to identify RNA biotypes in human plasma pooled from several healthy individuals. Additionally, they implicate binding by proteins (RBPs) and nuclease-resistant structures to explain a fraction of the RNAs observed in plasma. Generally I find the study fascinating and argue that the collection of plasma RNAs described is an important tool for those interested in extracellular RNAs. I think the possibility that RNPs are protecting RNA fragments in circulation is exciting and fits with elegant studies of insects and plants where RNAs are protected by this mechanism and are transmitted between species.

      I have one major comment for the authors to consider. In my view the use of pooled plasma samples prevented the important opportunity to provide a glimpse on human variation in plasma RNA biotypes. This significantly limits the use of this information to begin addressing RNA biotypes as biomarkers. While I realize that data from multiple individuals represents a significant undertaking and may be beyond the scope of this manuscript, I urge the authors to do two things: (1) downplay the significance of the current study on the development of biomarkers in the current manuscript (e.g., in the abstract and discussion - e.g., "The ability of TGIRT-seq to simultaneously profile a wide variety of RNA biotypes in human plasma, including structured RNAs that are intractable to retroviral RTs, may be advantageous for identifying optimal combinations of coding and non-coding RNA biomarkers for human diseases."). (2) Carry out an analysis in multiple individuals - including racially diverse individuals - very important information will come of this - similar to C. Burge's important study in Nature ~2008 where it was clear that there is important individual variation in alternative splicing decisions - very likely genetically determined. This second suggestion could be added here or constitute a future manuscript.

      The identification of biomarkers in human plasma is an important application of this study, as was noted by reviewer 3 -- "Overall, this study provided a robust dataset and expanded picture of RNA biotypes one can detect in human plasma. This is valuable because the findings may have implications in biomarker identification in disease contexts." The present manuscript lays the foundation for such applications, which we have been carrying out in parallel. In one such study in collaboration with Dr. Naoto Ueno (MD Anderson), we used TGIRT-seq to identify combinations of mRNA and non-coding RNA biomarkers in FFPE-tumor slices, PBMCs and plasma from inflammatory breast cancer patients compared to non-IBC breast cancer patients and healthy controls (manuscript in preparation; data presented publicly in seminars), and in another, we explored the potential of using full-length excised intron (FLEXI) RNAs as biomarkers. In the latter study, we identified >8,000 FLEXI RNAs in different human cell lines and tissues and found that they are expressed in a cell-type specific manner, including hundreds of differences between matched tumor and healthy tissues from breast cancer patients and cell lines. A manuscript describing the latter findings was submitted for publication after this one and has been uploaded as a pertinent related manuscript. This new manuscript follows directly from the last sentence of the present manuscript and fully references the BioRxiv preprint currently under review for eLife.

      Reviewer #2:

      Yao et al used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) to study apheresis plasma samples. The first interesting discovery is that they had identified a number of mRNA reads with putative binding sites of RNA-binding proteins. A second interesting discovery from this work is the detection of full-length excised intron RNAs.

      I have the following comments:

      1) One doubt that I have is how representative is apheresis plasma when compared with plasma that one obtains through routine centrifugation of blood. The authors have reported the comparison of apheresis plasma versus a single male plasma in a previous publication. I think that to address this important question, a much increased number of samples would be necessary.

      Detailed comparison of plasma prepared by apheresis to that prepared by centrifugation would require a separate large-scale study, preferably by multiple laboratories using different methods to prepare plasma. However, our impression both from our findings and from the literature (Valbonesi et al. 2001, cited in the manuscript) is that apheresis-prepared plasma has very low levels of cellular contamination (required to meet clinical standards) compared to plasma prepared by centrifugation, even with protocols designed to minimize contamination from intact 4 or broken cell (e.g., preparing plasma from freshly drawn blood, centrifugation into a Ficoll cushion to minimize cell breakage, and carefully avoiding contamination from sedimented cells).

      We do have additional information about the degree of variation in protein-coding gene transcripts detected by TGIRT-seq in plasma samples prepared by centrifugation from five healthy females controls in our collaborative study with Dr. Naoto Ueno (M.D. Anderson; see above), and we have added it to the manuscript citing a manuscript in preparation with permission from Dr. Ueno (p. 10, beginning line 6 from bottom) as follows:

      “The identities and relative abundances of different protein-coding gene transcripts in the apheresis-prepared plasma were broadly similar to those in the previous TGIRT analysis of plasma prepared by Ficoll-cushion sedimentation of blood from a healthy male individual (Qin et al., 2016) (r = 0.62-0.80; Figure 3C) and between high quality plasma samples similarly prepared from five healthy females in a collaborative study with Dr. Naoto Ueno, M.D. Anderson (r = 0.53-0.67; manuscript in preparation).” See Author Response Image below.

      2) For the important conclusion of the presence of binding sites of RNA-binding proteins in a proportion of apheresis plasma mRNA molecules, the authors need to explore whether there is any systemic difference in terms of mapping quality (i.e. mapping quality scores in alignment results) between RBP binding sites and non-RBP binding sites, so that any artifacts of peaks caused by the alignment issues occurring in RNA-seq analysis could be revealed and solved subsequently. Furthermore, it would be prudent to perform immunoprecipitation experiments to confirm this conclusion in at least a proportion of the mRNA.

      We have added a figure panel comparing MAPQ scores for reads from peaks containing RBP-binding site to other long RNA reads (Figure 4–figure supplement 2A) and have added further details about the methods used to obtain peaks with high quality reads, including the following (p. 13, beginning line 3 from the bottom).

      “After further filtering to remove read alignments with MAPQ <30 (a cutoff that eliminates reads mapping equally well at more than one locus) or ≥5 mismatches from the mapped locus, we were left with 950 high confidence peaks ranging in size from 59 to 1,207 nt with ≥5 high quality read alignments at the peak maximum (Supplementary File).”

      3) In Fig. 2D, one can observe that there are clearly more RNA reads in TGIRT-seq located in the 1st exon of ACTB, compared with SMART-seq. Is there any explanation? Will this signal be called as a peak (a potential RBP binding site) in the peak calling analysis (MACS2)? Is ACTB supposed to be bound by a certain RBP?

      The higher coverage of the ACTB 5'-exon in the TGIRT-seq datasets reflects in part the more uniform 5' to 3' coverage of mRNA sequences by TGIRT-seq compared to SMART-seq, which is biased for 3'-mRNA sequences that have poly(A) tails (current Figure 3F). The signal in the first exon of ACTB was in fact called as a peak by MACS2 (peak ID#893, Supplementary file), which overlapped an annotated binding site for SERBP1 (see Supplementary File).

      4) For Fig 2A, it would be informative for the comparison of RNA yield and RNA size profile among different protocols if the author also added the results of TGIRT-seq.

      Figure 3D (previously Figure 2A) shows a bioanalyzer trace of PCR amplified cDNAs obtained by SMART-Seq. These cDNAs correspond to 3' mRNA sequences that have poly(A) tails and are not comparable to the bioanalyzer profiles of plasma RNA (Figure 1–figure supplement 1) or read span distributions in the TGIRT-seq datasets (Figure 1B), which are dominated by sncRNAs. The coverage plots for protein-coding gene transcripts show that TGIRT-seq captures mRNA fragments irrespective of length that span the entire mRNA sequence, whereas SMART-seq is biased for 3' sequences linked to poly(A) (Figure 3F). We also note that coverage plots and mRNAs detected by TGIRT-seq remain similar, even if the plasma RNA is chemically fragmented prior to TGIRT-seq library construction (Figure 3F and Figure 3–figure supplement 2).

      5) As shown in Figure 4 C (the track of RBP binding sites), it seems quite pervasive in some gene regions. How many RBP binding sites from public eCLIP-seq results are used for overlapping peaks present in TGIRT-seq of plasma RNA? What percentage of plasma RNA reads have fallen within RBP binding sites? Are those peaks present in TGRIT-seq significantly enriched in RBPs binding regions?

      Some of these points are addressed under Reviewer 1-comment #4. Additionally, we noted that 109 RBP-binding sites were searched in the original analysis, and we have now added further analyses for 150 RBPs currently available in ENCODE eCLIP datasets with and without irreproducible discovery rate (IDR) analysis (Figure 6 and Figure 6–figure supplement 1). We have also added a tab to the Supplementary File identifying the 109 and 150 RBPs whose binding sites were searched. The requested statistical analysis has been added in Figure 4–figure supplement 2C. The analysis shows that enrichment of RBP-binding site sequences in the 467 called peaks was statistically significant (p<0.001) (p. 14, para. 3, last sentence).

      6) Since there is a considerable portion of TGIRT-seq reads related to simple repeat, one possible reason is likely the high abundance of endogenous repeat-related RNA species in plasma. Nonetheless, have authors studied whether the ligation steps in TGIRT-seq have any biases (e.g. GC content) when analyzing human reference RNAs and spike ins (page 4, paragraph 2)?

      We have added a note to the manuscript indicating that although repeat RNAs constitute a high proportion of the called peaks, they do not constitute a similarly high proportion of the total RNA reads (Figure 1C; p. 18, para. 2, first sentence). The TGIRT-seq analysis of human reference RNAs and spike-ins showed that TGIRT-seq recapitulates the relative abundance of human transcripts and spike-in comparably to non-strand-specific TruSeq v2 and better than strand-specific TruSeq v3 (Nottingham et al. RNA 2016). Subsequently, we used miRNA reference sets for detailed analysis of TGIRT-seq biases, including developing a computer algorithm for bias correction based on a random forest regression model that provides insight into different factors that contribute to these biases (Xu et al. Sci. Report. 2019). Overall GC content does not make a significant contribution to TGIRT-seq biases (Figure 9 of Xu et al. Sci. Report, 2017). Instead, biases in TGIRT-seq are largely confined to the first three nucleotides at the 5'-end (due to bias of the thermostable 5' App DNA ligase used for 5' RNA-seq adapter addition) and the 3' nucleotide (due to TGIRT-template switching). These end biases are not expected to significantly impact the quantitation of repeat RNAs.

      7) As described in Figure 2 legend, there are 0.25 million deduplicated reads for TGIRT-seq reads assigned to protein-coding genes transcripts which are far less than 2.18 million reads for SMART-seq. The authors need to discuss whether the current protocol of TGIRT-seq would cause potential dropouts in mRNA analysis, compared with SMART-seq?

      We have added the following to the manuscript (p. 11, para. 1, line 15).

      “The larger number of mRNA reads compared to TGIRT-seq (0.28 million) largely reflects that SMART-seq selectively profiles polyadenylated mRNAs, while TGIRT-seq profiles mRNAs together with other more abundant RNA biotypes. In addition, ultra low input SMART-Seq is not strand-specific, resulting in redundant sense and antisense strand reads (Figure 3–figure supplement 1).”

      The manuscript contains the following statement regarding potential drop outs (p. 11, para. 2, line 1).

      “A scatter plot comparing the relative abundance of transcripts originating from different genes showed that most of the polyadenylated mRNAs detected in DNase I-treated plasma RNA by ultra low input SMART-Seq were also detected by TGIRT-seq at similar TPM values when normalized for protein-coding gene reads (r=0.61), but with some, mostly lower abundance mRNAs undetected either by TGIRT-seq or SMART-Seq, and with SMART-seq unable to detect non-polyadenylated histone mRNAs, which are relatively abundant in plasma (Figure 3E and Figure 3–figure supplement 1).”

      8) While scientific thought-provoking, the practical implication of the current work is still unclear. The authors have suggested that their work might have applications for biomarker development. Is it possible to provide one experimental example in the manuscript?

      We addressed the relevance of the manuscript to biomarker identification and noted parallel studies that supports this application in the response to reviewer 1--comment 1. We have also modified the final paragraph of the Discussion (p. 30, para. 2).

      “The ability of TGIRT-seq to simultaneously profile a wide variety of RNA biotypes in human plasma, including structured RNAs that are intractable to retroviral RTs, may be advantageous for identifying optimal combinations of coding and non-coding RNA biomarkers that could then be incorporated in target RNA panels for diagnosis and routine monitoring of disease progression and response to treatment. The finding that some mRNAs fragments persist in discrete called peaks suggests a strategy for identifying relatively stable mRNA regions that may be more reliably detected than other more labile regions in targeted liquid biopsies. Finally, we note that in addition to their biological and evolutionary interest, short full-length excised intron RNAs and intron RNA fragments, such as those identified here, may be uniquely well suited to serve as stable RNA biomarkers, whose expression is linked to that of numerous protein-coding genes."

      Reviewer #3:

      In this work, Yao and colleagues described transcriptome profiling of human plasma from healthy individuals by TGIRT-seq. TGIRT is a thermostable group II intron reverse transcriptase that offers improved fidelity, processivity and strand-displacement activity, as compared to standard retroviral RT, so that it can read through highly structured regions. Similar analysis was performed previously (ref. 20), but this study incorporated several improvements in library preparation including optimization of template switching condition and modified adapters to reduce primer dimer and introduce UMI. In their analysis, the authors detected a variety of structural RNA biotypes, as well as reads from protein-coding mRNAs, although the latter is in low abundance. Compared to SMART-Seq, TGIRT-seq also achieved more uniform read coverage across gene bodies. One novel aspect of this study is the peak analysis of TGIRT-seq reads, which revealed ~900 peaks over background. The authors found that these peaks frequently overlap with RBP binding sites, while others tend to have stable predicted secondary structures, which explains why these regions are protected from degradation in plasma. Overall, this study provided a robust dataset and expanded picture of RNA biotypes one can detect in human plasma. This is valuable because the findings may have implications in biomarker identification in disease contexts. On the other hand, the manuscript, in the current form, is relatively descriptive, and can be improved with a clearer message of specific knowledge that can be extracted from the data.

      Specific points:

      1) Several aspects of bioinformatics analysis can be clarified in more detail. For example, it is unclear how sequencing errors in UMI affect their de-duplication procedure. This is important for their peak analysis, so it should be explained clearly.

      We have added details of the procedure used for de-duplication to the following paragraph in Materials and methods (p. 35, para. 2).

      “Deduplication of mapped reads was done by UMI, CIGAR string, and genome coordinates (Quinlan, 2014). To accommodate base-calling and PCR errors and non-templated nucleotides that may have been added to the 3' ends of cDNAs during TGIRT-seq library preparation, one mismatch in the UMI was allowed during deduplication, and fragments with the same CIGAR string, genomic coordinates (chromosome start and end positions), and UMI or UMIs that differed by one nucleotide were collapsed into a single fragment. The counts for each read were readjusted to overcome potential UMI saturation for highly-expressed genes by implementing the algorithm described in (Fu et al., 2011), using sequencing tools (https://github.com/wckdouglas/sequencing_tools ).”

      Also, it is not described how exon junction reads (when mapped to the genome) are handled in peak calling, although the authors did perform complementary analysis by mapping reads to the reference transcriptome.

      We have added this to first sentence of the paragraph describing peak calling against the transcriptome reference (p. 16, line 4), which now reads as follows:

      "Peak calling against the human genome reference sequence might miss RBP-binding sites that are close to or overlap exon junctions, as such reads were treated by MACS2 as long reads that span the intervening intron."

      2) Overall, the authors provided convincing data that TGIRT-seq has advantages in detecting a wide range of RNA biotypes, especially structured RNAs, compared to other protocols, but these data are more confirmatory, rather than completely new findings (e.g., compared to ref. 20).

      As indicated in the response to Reviewer 1, comment 2, we modified the first paragraph of the Discussion to explicitly describe what is added by the present manuscript compared to Qin et al. RNA 2016 (p. 24, para. 2). Additionally, further analysis in response to the reviewers' comments resulted in the interesting finding that stress granule proteins comprised a high proportion of the RBPs whose binding sites were enriched in plasma RNAs (to our knowledge a completely new finding), consistent with a previously suggested link between RNP granules, EV packing, and RNA export (p. 16, last sentence; data shown in Figure 6 and Figure 6–figure supplement 1). Also highlighted in the Discussion p. 26, last sentence, continuing on p. 27).

      3) The peak analysis is more novel. The authors observed that 50% of peaks in long RNAs overlap with eCLIP peaks. However, there is no statistical analysis to show whether this overlap is significant or simply due to the pervasive distribution of eCLIP peaks. In fact, it was reported by the original authors that eCLIP peaks cover 20% of the transcriptome.

      We have added statistical analysis, which shows that the enrichment of RBP-binding sites in the 467 called peaks is statistically significant at p<0.001 (p. 14, para. 3, last sentence; Figure 4–Figure supplement 2C), as well as scatter plots identifying proteins whose binding sites were more highly represented in plasma than cellular RNAs or vice versa (p. 16, last two sentences; Figure 6 and Figure 6-figure supplement 1).

      Similarly, the authors found that a high proportion of remaining peaks can fold into stable secondary structures, but this claim is not backed up by statistics either.

      First, near the beginning of the paragraph describing these findings, we added the following to provide a guide as to what can and can't be concluded by RNAfold (p. 17, line 6 from the bottom).

      "To evaluate whether these peaks contained RNAs that could potentially fold into stable secondary structures, we used RNAfold, a tool that is widely used for this purpose with the understanding that the predicted structures remain to be validated and could differ under physiological conditions or due to interactions with proteins."

      Second, at the end of the same paragraph, we have added the requested statistics (p. 18, para. 1, last sentence).

      "Subject to the caveats above regarding conclusions drawn from RNAfold, simulations using peaks randomly generated from long RNA gene sequences indicated that enrichment of RNAs with more stable secondary structures (lower MFEs) in the called RNA peaks was statistically significant (p≤0.019; Figure 4–figure supplement 2D)."

      4) Ranking of RBPs depends on the total number of RBP binding sites detected by eCLIP, which is determined by CLIP library complexity and sequencing depth. This issue should be at least discussed.

      We have added scatter plots in Figure 6 and Figure 6–figure supplement 1, which show that the relative abundance of different RBP-binding sites detected in plasma differs markedly from that for cellular RNAs in the eCLIP datasets (both for the 109 RBPs searched initially and for 150 RBPs with or without irreproducible discovery rate (IDR) analysis from the ENCODE web site,) As mentioned in comments above, this analysis identified a number of RBP-binding sites that were substantially enriched in plasma RNAs compared to cellular RNAs or vice versa and led to what we think is the important new finding that plasma RNAs are enriched binding sites for a number of stress granule proteins (Figure 6 and Figure 6–figures supplement 1). We thank the reviewers for this and related comments that led to this additional analysis.

      5) Enrichment of RBP binding sites and structured RNA in TGIRT-seq data is certainly consistent with one's expectation. However, the paper can be greatly improved if the authors can make a clearer case of what is new that can be learned, as compared to eCLIP data or other related techniques that purify and sequence RNA fragments crosslinked to proteins. What is the additional, independent evidence to show the predicted secondary structures are real?

      Compared to CLIP and related methods, peak calling enables more facile identification of candidate RBPs and putatively structured RNAs for further analysis and may be particularly useful for the vanishingly small amounts of RNA present in plasma and other bodily fluids. New findings resulting from peak calling in the present manuscript include that plasma RNAs are enriched in binding sites for stress granule proteins (see above) and the discovery of a variety of novel RNAs, including the full-length excised intron RNAs first identified here and subsequently studied in cellular RNAs in the Yao et al. pertinent submitted manuscript. We also note that peak calling enables the identification of protein-protected and structured mRNA regions that are relatively stable in plasma and may be more reliably detected in targeted liquid biopsy assays than are more labile mRNA regions (p. 17, para. 1, last sentence; and p. 30, para. 2, beginning on line 5).

      6) The authors should probably discuss how alignment errors can potentially affect detection of repetitive regions.

      In the Empirical Bayes method that we used for the analysis of repeats, repeat sequences were quantified by aggregate counts irrespective of the genomic locus to which they mapped (Materials and methods, p. 38, para. 2, line 5), which should not be affected by alignment errors.

      7) Many figures are IGV screenshots, which can be difficult to follow. Some of them can probably be summarized to deliver the message better.

      Some IGV-based figures are crucial for showing key features of the RNAs that are called as peaks (e.g., the predicted secondary structures of the full-length excised intron RNAs and intron RNA fragments). However, in the process of reformatting, we have switched in and added non-IGV main text figures including Figure 2 (microbiome analysis), Figure 3 (TGIRT-seq versus SMART-Seq), Figure 4 (repeats), and Figure 6 (new figure comparing relative abundance of RBP-binding sites in plasma versus cells).

    1. Author Response:

      Reviewer #1 (Public Review):

      Strengths:

      1) The model structure is appropriate for the scientific question.

      2) The paper addresses a critical feature of SARS-CoV-2 epidemiology which is its much higher prevalence in Hispanic or Latino and Black populations. In this sense, the paper has the potential to serve as a tool to enhance social justice.

      3) Generally speaking, the analysis supports the conclusions.

      Other considerations:

      1) The clean distinction between susceptibility and exposure models described in the paper is conceptually useful but is unlikely to capture reality. Rather, susceptibility to infection is likely to vary more by age whereas exposure is more likely to vary by ethnic group / race. While age cohort are not explicitly distinguished in the model, the authors would do well to at least vary susceptibility across ethnic groups according to different age cohort structure within these groups. This would allow a more precise estimate of the true effect of variability in exposures. Alternatively, this could be mentioned as a limitation of the the current model.

      We agree that this would be an important extension for future work and have indicated this in the Discussion, along with the types of data necessary to fit such models:

      “Fourth, due to data availability, we have only considered variability in exposure due to one demographic characteristic; models should ideally strive to also account for the effects of age on susceptibility and exposure within strata of race and ethnicity and other relevant demographics, such as socioeconomic status and occupation \cite{Mulberry2021-tc}. These models could be fit using representative serological studies with detailed cross-tabulated seropositivity estimates.”

      2) I appreciated that the authors maintained an agnostic stance on the actual value of HIT (across the population & within ethnic groups) based on the results of their model. If there was available data, then it might be possible to arrive at a slightly more precise estimate by fitting the model to serial incidence data (particularly sorted by ethnic group) over time in NYC & Long Island. First, this would give some sense of R_effective. Second, if successive waves were modeled, then the shift in relative incidence & CI among these groups that is predicted in Figure 3 & Sup fig 8 may be observed in the actual data (this fits anecdotally with what I have seen in several states). Third, it may (or may not) be possible to estimate values of critical model parameters such as epsilon. It would be helpful to mention this as possible future work with the model.

      Caveats about the impossibility of truly measuring HIT would still apply (due to new variants, shifting use & effective of NPIs, etc….). However, as is, the estimates of possible values for HIT are so wide as to make the underlying data used to train the model almost irrelevant. This makes the potential to leverage the model for policy decisions more limited.

      We have highlighted this important limitation in the Discussion:

      “Finally, we have estimated model parameters using a single cross-sectional serosurvey. To improve estimates and the ability to distinguish between model structures, future studies should use longitudinal serosurveys or case data stratified by race and ethnicity and corrected for underreporting; the challenge will be ensuring that such data are systematically collected and made publicly available, which has been a persistent barrier to research efforts \cite{Krieger2020-ss}. Addressing these data barriers will also be key for translating these and similar models into actionable policy proposals on vaccine distribution and non-pharmaceutical interventions.”

      3) I think the range of R0 in the figures should be extended to go as as low as 1. Much of the pandemic in the US has been defined by local Re that varies between 0.8 & 1.2 (likely based on shifts in the degree of social distancing). I therefore think lower HIT thresholds should be considered and it would be nice to know how the extent of assortative mixing effects estimates at these lower R_e values.

      We agree this would be of interest and have extended the range of R0 values. Figure 1 has been updated accordingly (see below); we also updated the text with new findings: “After fitting the models across a range of $\epsilon$ values, we observed that as $\epsilon$ increases, HITs and epidemic final sizes shifted higher back towards the homogeneous case (Figure \ref{fig:model2}, Figure 1-figure supplement 4); this effect was less pronounced for $R_0$ values close to 1.”

      Figure 1: Incorporating assortativity in variable exposure models results in increased HITs across a range of $R_0$ values. Variable exposure models were fitted to NYC and Long Island serosurvey data.

      4) line 274: I feel like this point needs to be considered in much more detail, either with a thoughtful discussion or with even with some simple additions to the model. How should these results make policy makers consider race and ethnicity when thinking about the key issues in the field right now such as vaccine allocation, masking, and new variants. I think to achieve the maximal impact, the authors should be very specific about how model results could impact policy making, and how we might lower the tragic discrepancies associated with COVID. If the model / data is insufficient for this purpose at this stage, then what type of data could be gathered that would allow more precise and targeted policy interventions?

      We have conducted additional analyses exploring the important suggestion by the reviewers that social distancing could affect these conclusions. The text and figures have been updated accordingly:

      “Finally, we assessed how robust these findings were to the impact of social distancing and other non- pharmaceutical interventions (NPIs). We modeled these mitigation measures by scaling the transmission

      rate by a factor $\alpha$ beginning when 5\% cumulative incidence in the population was reached. Setting the duration of distancing to be 50 days and allowing $\alpha$ to be either 0.3 or 0.6 (i.e. a 70\% or 40\% reduction in transmission rates, respectively), we assessed how the $R_0$ versus HIT and final epidemic size relationships changed. We found that the $R_0$ versus HIT relationship was similar to in the unmitigated epidemic (Figure 1-figure supplement 5). In contrast, final epidemic sizes depended on the intensity of mitigation measures, though qualitative trends across models (e.g. increased assortativity leads to greater final sizes) remained true (Figure 1-figure supplement 6). To explore this further, we systematically varied $\alpha$ and the duration of NPIs while holding $R_0$ constant at 3. We found again that the HIT was consistent, whereas final epidemic sizes were substantially affected by the choice of mitigation parameters (Figure 1-figure supplement 7); the distribution of cumulative incidence at the point of HIT was also comparable with and without mitigation measures (Figure 2-figure supplement 8). The most stringent NPI intensities did not necessarily lead to the smallest epidemic final sizes, an idea which has been explored in studies analyzing optimal control measures \cite{Neuwirth2020- nb,Handel2007-ee}. Longitudinal changes in incidence rate ratios also were affected by NPIs, but qualitative trends in the ordering of racial and ethnic groups over time remained consistent (Figure 3- figure supplement 3).

      Figure 1-figure supplement 6: Final epidemic sizes versus $R_0$ in variable exposure models with mitigation measures for $\alpha = 0.3$ (top) and $\alpha = 0.6$ (bottom). NPIs were initiated when cumulative incidence reached 5\% in all models and continued for 50 days. Models were fitted to NYC and Long Island serosurvey data.

      Figure 1-figure supplement 7: Sensitivity analysis on the impact of intensity and duration of NPIs on final epidemic sizes. HIT values for the same mitigation parameters were 46.4 $\pm$ 0.5\% (range). The smallest final size, corresponding to $\alpha = 0.6$ and duration = 100, was 51\%. Census-informed assortativity models were fit to Long Island seroprevalence data. NPIs were initiated when cumulative incidence reached 5\% in all models.

      See points 1 and 2 above for examples of additional data required.

      Minor issues:

      -This is subjective but I found the words "active" and "high activity" to describe increases in contacts per day to be confusing. I would just say more contacts per day. It might help to change "contacts" to "exposure contacts" to emphasize that not all contacts are high risk.

      To clarify this, we have replaced instances of “activity level” (and similar) with “total contact rate”, indicating the total number of contacts per unit time per individual; e.g. “The estimated total contact rate ratios indicate higher contacts for minority groups such as Hispanics or Latinos and non-Hispanic Black people, which is in line with studies using cell phone mobility data \cite{Chang2020-in}; however, the magnitudes of the ratios are substantially higher than we expected given the findings from those studies.”

      We have also clarified our definition of contacts: “We define contacts to be interactions between individuals that allow for transmission of SARS-CoV-2 with some non-zero probability.”

      -The abstract has too much jargon for a generalist journal. I would avoid words like "proportionate mixing" & "assortative" which are very unique to modeling of infectious diseases unless they are first defined in very basic language.

      We have revised the abstract to convey these same concepts in a more accessible manner: “A simple model where interactions occur proportionally to contact rates reduced the HIT, but more realistic models of preferential mixing within groups increased the threshold toward the value observed in homogeneous populations.”

      -I would cite some of the STD models which have used similar matrices to capture assortative mixing.

      We have added a reference in the assortative mixing section to a review of heterogeneous STD models: “Finally, under the \textit{assortative mixing} assumption, we extended this model by partitioning a fraction $\epsilon$ of contacts to be exclusively within-group and distributed the rest of the contacts according to proportionate mixing (with $\delta_{i,j}$ being an indicator variable that is 1 when $i=j$ and 0 otherwise) \cite{Hethcote1996-bf}:”

      -Lines 164-5: very good point but I would add that members of ethnic / racial groups are more likely to be essential workers and also to live in multigenerational houses

      We have added these helpful examples into the text: “Variable susceptibility to infection across racial and ethnic groups has been less well characterized, and observed disparities in infection rates can already be largely explained by differences in mobility and exposure \cite{Chang2020-in,Zelner2020- mb,Kissler2020-nh}, likely attributable to social factors such as structural racism that have put racial and ethnic minorities in disadvantaged positions (e.g., employment as frontline workers and residence in overcrowded, multigenerational homes) \cite{Henry_Akintobi2020-ld,Thakur2020-tw,Tai2020- ok,Khazanchi2020-xu}.”

      -Line 193: "Higher than expected" -> expected by who?

      We have clarified this phrase: “The estimated total contact rate ratios indicate higher exposure contacts for minority groups such as Hispanics or Latinos and non-Hispanic Black people, which is in line with studies using cell phone mobility data \cite{Chang2020-in}; however, the magnitudes of the ratios are substantially higher than we expected given the findings from those studies.”

      -A limitation that needs further mention is that fact that race & ethnic group, while important, could be sub classified into strata that inform risk even more (such as SES, job type etc….)

      We agree and have added this to the Discussion: “Fourth, due to data availability, we have only considered variability in exposure due to one demographic characteristic; models should ideally strive to also account for the effects of age on susceptibility and exposure within strata of race and ethnicity and other relevant demographics, such as socioeconomic status and occupation \cite{Mulberry2021-tc}. These models could be fit using representative serological studies with detailed cross-tabulated seropositivity estimates.”

      Reviewer #2 (Public Review):

      Overall I think this is a solid and interesting piece that is an important contribution to the literature on COVID-19 disparities, even if it does have some limitations. To this point, most models of SARS-CoV-2 have not included the impact of residential and occupational segregation on differential group-specific covid outcomes. So, the authors are to commended on their rigorous and useful contribution on this valuable topic. I have a few specific questions and concerns, outlined below:

      We thank the reviewer for the supportive comments.

      1) Does the reliance on serosurvey data collected in public places imply a potential issue with left-censoring, i.e. by not capturing individuals who had died? Can the authors address how survival bias might impact their results? I imagine this could bring the seroprevalence among older people down in a way that could bias their transmission rate estimates.

      We have included this important point in the limitations section on potential serosurvey biases: “First, biases in the serosurvey sampling process can substantially affect downstream results; any conclusions drawn depend heavily on the degree to which serosurvey design and post-survey adjustments yield representative samples \cite{Clapham2020-rt}. For instance, because the serosurvey we relied on primarily sampled people at grocery stores, there is both survival bias (cumulative incidence estimates do not account for people who have died) and ascertainment bias (undersampling of at-risk populations that are more likely to self-isolate, such as the elderly) \cite{Rosenberg2020-qw,Accorsi2021-hx}. These biases could affect model estimates if, for instance, the capacity to self-isolate varies by race or ethnicity -- as suggested by associations of neighborhood-level mobility versus demographics \cite{Kishore2020- sy,Kissler2020-nh} -- leading to an overestimate of cumulative incidence and contact rates in whites.”

      2) It might be helpful to think in terms of disparities in HITs as well as disparities in contact rates, since the HIT of whites is necessarily dependent on that of Blacks. I'm not really disagreeing with the thrust of what their analysis suggests or even the factual interpretation of it. But I do think it is important to phrase some of the conclusions of the model in ways that are more directly relevant to health equity, i.e. how much infection/vaccination coverage does each group need for members of that group to benefit from indirect protection?

      We agree with this important point and indeed this was the goal, in part, of the analyses in Figure 2. We have added additional text to the Discussion highlighting this: “Projecting the epidemic forward indicated that the overall HIT was reached after cumulative incidence had increased disproportionately in minority groups, highlighting the fundamentally inequitable outcome of achieving herd immunity through infection. All of these factors underscore the fact that incorporating heterogeneity in models in a mechanism-free manner can conceal the disparities that underlie changes in epidemic final sizes and HITs. In particular, overall lower HIT and final sizes occur because certain groups suffer not only more infection than average, but more infection than under a homogeneous mixing model; incorporating heterogeneity lowers the HIT but increases it for the highest-risk groups (Figure \ref{fig:hitcomp}).”

      For vaccination, see our response to Reviewer #1 point 4.

      3) The authors rely on a modified interaction index parameterized directly from their data. It would be helpful if they could explain why they did not rely on any sources of mobility data. Are these just not broken down along the type of race/ethnicity categories that would be necessary to complete this analysis? Integrating some sort of external information on mobility would definitely strengthen the analysis.

      This is a great suggestion, but this type of data has generally not been available due to privacy concerns from disaggregating mobility data by race and ethnicity (Kishore et al., 2020). Instead, we modeled NPIs as mentioned in Reviewer #1 point 4, with the caveat that reduction in mobility was assumed to be identical across groups. We added this into the text explicitly as a limitation: “Third, we have assumed the impact of non-pharmaceutical interventions such as stay-at-home policies, closures, and the like to equally affect racial and ethnic groups. Empirical evidence suggests that during periods of lockdown, certain neighborhoods that are disproportionately wealthy and white tend to show greater declines in mobility than others \cite{Kishore2020-sy,Kissler2020-nh}. These simplifying assumptions were made to aid in illustrating the key findings of this model, but for more detailed predictive models, the extent to which activity level differences change could be evaluated using longitudinal contact survey data \cite{Feehan2020-ta}, since granular mobility data are typically not stratified by race and ethnicity due to privacy concerns \cite{Kishore2020-mg}.”

      Reviewer #3 (Public Review):

      Ma et al investigate the effect of racial and ethnic differences in SARS-CoV-2 infection risk on the herd immunity threshold of each group. Using New York City and Long Island as model settings, they construct a race/ethnicity-structured SEIR model. Differential risk between racial and ethnic groups was parameterized by fitting each model to local seroprevalence data stratified demographically. The authors find that when herd immunity is reached, cumulative incidence varies by more than two fold between ethnic groups, at approximately 75% of Hispanics or Latinos and only 30% of non-Hispanic Whites.

      This result was robust to changing assumptions about the source of racial and ethnic disparities. The authors considered differences in disease susceptibility, exposure levels, as well as a census-driven model of assortative mixing. These results show the fundamentally inequitable outcome of achieving herd immunity in an unmitigated epidemic.

      The authors have only considered an unmitigated epidemic, without any social distancing, quarantine, masking, or vaccination. If herd immunity is achieved via one of these methods, particularly vaccination, the disparities may be mitigated somewhat but still exist. This will be an important question for epidemiologists and public health officials to consider throughout the vaccine rollout.

      We thank the reviewer for the detailed and helpful summary and suggestions.

    1. Author Response

      Summary: A major tenet of plant pathogen effector biology has been that effectors from very different pathogens converge on a small number of host targets with central roles in plant immunity. The current work reports that effectors from two very different pathogens, an insect and an oomycete, interact with the same plant protein, SIZ1, previously shown to have a role in plant immunity. Unfortunately, apart from some technical concerns regarding the strength of the data that the effectors and SIZ1 interact in plants, a major limitation of the work is that it is not demonstrated that the effectors alter SIZ1 activity in a meaningful way, nor that SIZ1 is specifically required for action of the effects.

      We thank the editor and reviewers for their time to review our manuscript and their helpful and constructive comments. The reviews have helped us focus our attention on additional experiments to test the hypothesis that effectors Mp64 (from an aphid) and CRN83-152 (from an oomycete) indeed alter SIZ1 activity or function. We have revised our manuscript and added the following data:

      1) Mp64, but not CRN83-152, stabilizes SIZ1 in planta. (Figure 1 in the revised manuscript).

      2) AtSIZ1 ectopic expression in Nicotiana benthamiana triggers cell death from 3-4 days after agroinfiltration. Interestingly CRN83-152_6D10 (a mutant of CRN83-152 that has no cell death activity), but not Mp64, enhances the cell death triggered by AtSIZ1 (Figure 2 in the revised manuscript).

      For 1) we have added the following panel to Figure 1 as well as three biological replicates of the stabilisation assays in the Supplementary data (Fig S3):

      Figure 1 panel C. Stabilisation of SIZ1 by Mp64. Western blot analyses of protein extracts from agroinfiltrated leaves expressing combinations of GFP-GUS, GFP Mp64 and GFP-CRN83_152_6D10 with AtSIZ1-myc or NbSIZ1-myc. Protein size markers are indicated in kD, and equal protein amounts upon transfer is shown upon ponceau staining (PS) of membranes. Blot is representative of three biological replicates , which are all shown in supplementary Fig. S3. The selected panels shown here are cropped from Rep 1 in supplementary Fig. S3.

      For 2) we have added the folllowing new figure (Fig. 2 in the revised manuscript):

      Fig. 2. SIZ1-triggered cell death in N. benthamiana is enhanced by CRN83_152_6D10 but not Mp64. (A) Scoring overview of infiltration sites for SIZ1 triggered cell death. Infiltration site were scored for no symptoms (score 0), chlorosis with localized cell death (score 1), less than 50% of the site showing visible cell death (score 2), more than 50% of the site showing cell death (score 3). (B) Bar graph showing the proportions of infiltration sites showing different levels of cell death upon expression of AtSIZ1, NbSIZ1 (both with a C-terminal RFP tag) and an RFP control. Graph represents data from a combination of 3 biological replicates of 11-12 infiltration sites per experiment (n=35). (C) Bar graph showing the proportions of infiltration sites showing different levels of cell death upon expression of SIZ1 (with C-terminal RFP tag) either alone or in combination with aphid effector Mp64 or Phytophthora capsica effector CRN83_152_6D10 (both effectors with GFP tag), or a GFP control. Graph represent data from a combination of 3 biological replicates of 11-12 infiltration sites per experiment (n=35).

      Our new data provide further evidence that SIZ1 function is affected by effectors Mp64 (aphid) and CRN83-152 (oomycete), and that SIZ1 likely is a vital virulence target. Our latest results also provide further support for distinct effector activities towards SIZ1 and its variants in other species. SIZ1 is a key immune regulator to biotic stresses (aphids, oomycetes, bacteria and nematodes), on which distinct virulence strategies seem to converge. The mechanism(s) underlying the stabilisation of SIZ1 by Mp64 is yet unclear. However, we hypothesize that increased stability of SIZ1, which functions as an E3 SUMO ligase, leads to increased SUMOylation activity towards its substrates. We surmise that SIZ1 complex formation with other key regulators of plant immunity may underpin these changes. Whether the cell death, triggered by AtSIZ1 upon transient expression in Nicotiana benthamiana, is linked to E3 SUMO ligase activity remains to be investigated. Expression of AtSIZ1 in a plant species other than Arabidopsis may lead to mistargeting of substrates, and subsequent activation of cell death. Dissecting the mechanistic basis of SIZ1 targeting by distinct pathogens and pests will be an important next step in addressing these hypotheses towards understanding plant immunity.

      Reviewer #1:

      In this manuscript, the authors suggest that SIZ1, an E3 SUMO ligase, is the target of both an aphid effector (Mp64 form M. persicae) and an oomycete effector (CRN83_152 from Phytophthora capsica), based on interaction between SIZ1 and the two effectors in yeast, co-IP from plant cells and colocalization in the nucleus of plant cells. To support their proposal, the authors investigate the effects of SIZ1 inactivation on resistance to aphids and oomycetes in Arabidopsis and N. benthamiana. Surprisingly, resistance is enhanced, which would suggest that the two effectors increase SIZ1 activity.

      Unfortunately, not only do we not learn how the effectors might alter SIZ1 activity, there is also no formal demonstration that the effects of the effectors are mediated by SIZ1, such as investigating the effects of Mp64 overexpression in a siz1 mutant. We note, however, that even this experiment might not be entirely conclusive, since SIZ1 is known to regulate many processes, including immunity. Specifically, siz1 mutants present autoimmune phenotype, and general activation of immunity might be sufficient to attenuate the enhanced aphid susceptibility seen in Mp64 overexpressers.

      To demonstrate unambiguously that SIZ1 is a bona fide target of Mp64 and CRN83_152 would require assays that demonstrate either enhanced SIZ1 accumulation or altered SIZ1 activity in the presence of Mp64 and CRN83_152.

      The enhanced resistance upon knock-down/out of SIZ1 suggests pathogen and pest susceptibility requires SIZ1. We hypothesize that the effectors either enhance SIZ1 activity or that the effectors alter SIZ1 specificity towards substrates rather than enzyme activity itself. To investigate how effectors coopt SIZ1 function would require a comprehensive set of approaches and will be part of our future work. While we agree that this aspect requires further investigation, we think the proposed experiments go beyond the scope of this study.

      After receiving reviewer comments, including on the quality of Figure 1, which shows western blots of co-immunoprecipitation experiments, we re-analyzed independent replicates of effector-SIZ1 coexpression/ co-immunoprecipitation experiments. The reviewer rightly pointed out that in the presence of Mp64, SIZ1 protein levels increase when compared to samples in which either the vector control or CRN83-152_6D10 are co-infiltrated. Through carefully designed experiments, we can now affirm that Mp64 co-expression leads to increased SIZ1 protein levels (Figure 1C and Supplementary Figure S3, revised manuscript). Our results offer both an explanation of different SIZ1 levels in the input samples (original submission, Figure 1A/B) as well as tantalizing new clues to the nature of distinct effector activities.

      Besides, we were able to confirm a previous preliminary finding not included in the original submission that ectopic expression of AtSIZ1 in Nicotiana benthamiana triggers cell death (3/4 days after infiltration) and that CRN83-152_6D10 (which itself does not trigger cell death) enhances this phenotype.

      We have considered overexpression of Mp64 in the siz1 mutant, but share the view that the outcome of such experiments will be far from conclusive.

      In summary, we have added new data that further support that SIZ1 is a bonafide target of Mp64 and CRN83-152 (i.e. increased accumulation of SIZ1 in the presence of Mp64, and enhanced SIZ cell death activation in the presence of CRN83-152_6D10).

      Reviewer #2:

      The study provides evidence that an aphid effector Mp64 and a Phytophthora capsici effector CRN83_152 can both interact with the SIZ1 E3 SUMO-ligase. The authors further show that overexpression of Mp64 in Arabidopsis can enhance susceptibility to aphids and that a loss-of-function mutation in Arabidopsis SIZ1 or silencing of SIZ1 in N. benthamiana plants lead to increased resistance to aphids and P. capsici. On siz1 plants the aphids show altered feeding patterns on phloem, suggestive of increased phloem resistance. While the finding is potentially interesting, the experiments are preliminary and the main conclusions are not supported by the data.

      Specific comments:

      The suggestion that SIZ1 is a virulence target is an overstatement. Preferable would be knockouts of effector genes in the aphid or oomycete, but even with transgenic overexpression approaches, there are no direct data that the biological function of the effectors requires SIZ1. For example, is SIZ1 required for the enhanced susceptibility to aphid infestation seen when Mp64 is overexpressed? Or does overexpression of SIZ1 enhance Mp64-mediated susceptibility?

      What do the effectors do to SIZ1? Do they alter SUMO-ligase activity? Or are perhaps the effectors SUMOylated by SIZ1, changing effector activity?

      We agree that having effector gene knock-outs in aphids and oomycetes would be ideal for dissecting effector mediated targeting of SIZ1. Unfortunately, there is no gene knock-out system established in Myzus persicae (our aphid of interest), and CAS9 mediated knock-out of genes in Phytophthora capsici has not been successful in our lab as yet, despite published reports. Moreover, repeated attempts to silence Mp64, other effector and non-effector coding genes, in aphids (both in planta and in vitro) have not been successful thus far, in our hands. As detailed in our response to Reviewer 1, we considered the use of transgenic approaches not appropriate as data interpretation would become muddied by the strong immunity phenotype seen in the siz1-2 mutant.

      As stated before, we hypothesize that the effectors either enhance SIZ1 activity or alter SIZ1 substrate specificity. Mp64-induced accumulation of SIZ1 could form the basis of an increase in overall SIZ1 activity. This hypothesis, however, requires testing. The same applies to the enhanced SIZ1 cell death activation in the presence of CRN83-152_6D10.

      Whilst our new data support our hypothesis that effectors Mp64 and CRN83-152 affect SIZ1 function, how exactly these effectors trigger susceptibility, requires significant work. Given the substantial effort needed and the research questions involved, we argue that findings emanating from such experiments warrant standalone publication.

      While stable transgenic Mp64 overexpressing lines in Arabidopsis showed increased susceptibility to aphids, transient overexpression of Mp64 in N. benthamiana plants did not affect P. capsici susceptibility. The authors conclude that while the aphid and P. capsici effectors both target SIZ1, their activities are distinct. However, not only is it difficult to compare transient expression experiments in N. benthamiana with stable transgenic Arabidopsis plants, but without knowing whether Mp64 has the same effects on SIZ1 in both systems, to claim a difference in activities remains speculative.

      We agree that we cannot compare effector activities between different plant species. We carefully considered every statement regarding results obtained on SIZ1 in Arabidopsis and Nicotiana benthamiana. We can, however, compare activities of the two effectors when expressed side by side in the same plant species. In our original submission, we show that expression of CRN83 152 but not Mp64 in Nicotiana benthamiana enhances susceptibility to Phytophthora capsici. In our revised manuscript, we present new data showing distinct effector activities towards SIZ1 with regards to 1) enhanced SIZ1 stability and 2) enhanced SIZ1 triggered cell death. These findings raise questions as to how enhanced SIZ1 stability and cell death activation is relevant to immunity. We aim to address these critical questions by addressing the mechanistic basis of effector-SIZ1 interactions.

      The authors emphasize that the increased resistance to aphids and P. capsici in siz1 mutants or SIZ1 silenced plants are independent of SA. This seems to contradict the evidence from the NahG experiments. In Fig. 5B, the effects of siz1 are suppressed by NahG, indicating that the resistance seen in siz1 plants is completely dependent on SA. In Fig 5A, the effects of siz1 are not completely suppressed by NahG, but greatly attenuated. It has been shown before that SIZ1 acts only partly through SNC1, and the results from the double mutant analyses might simply indicate redundancy, also for the combinations with eds1 and pad4 mutants.

      We emphasized that siz1-2 increased resistance to aphids is independent of SA, which is supported by our data (Figure 5A). Still, we did not conclude that the same applies to increased resistance to Phytophthora capsici (Figure 5B). In contrast, the siz1-2 enhanced resistance to P. capsici appears entirely dependent on SA levels, with the level of infection on the siz1-2/NahG mutants even slightly higher than on the NahG line and Col-0 plants. We exercise caution in the interpretation of this data given the significant impact SA signalling appears to have on Phytophthora capsici infection.

      The reviewer commented on the potential for functional redundancy in the siz1-2 double mutants. Unfortunately, we are unsure what redundancy s/he is referring to. SNC1, EDS1, and PAD4 all are components required for immunity, and their removal from the immune signalling network (using the mutations in the lines we used here) impairs immunity to various plant pathogens. The siz1-2 snc1-11, siz1-2 eds1-2, and siz1-2 pad4-1 double mutants have similar levels of susceptibility to the bacterial pathogen Pseudomonas syringae when compared to the corresponding snc1-11, eds1-2 and pad4-1 controls (at 22oC). These previous observations indicate that siz1 enhanced resistance is dependent on these signalling components (Hammoudi et al., 2018, Plos Genetics).

      In contrast to this, we observed a strong siz1 enhanced resistance phenotype in the absence of snc1- 11, eds1 2 and pad4-1. Notably, the siz1-2 snc1-11 mutant does not appear immuno-compromised when compared to siz1-2 in fecundity assays, indicating that the siz1-2 phenotype is independent of SNC1. In our view, these data suggest that signalling components/pathways other than those mediated by SNC1, EDS1, and PAD4 are involved. We consider this to be an exciting finding as our data points to an as of yet unknown SIZ1-dependent signalling pathway that governs immunity to aphids.

      How do NahG or Mp64 overexpression affect aphid phloem ingestion? Is it the opposite of the behavior on siz1 mutants?

      We have not performed further EPG experiments on additional transgenic lines used in the aphid assay. These experiments are quite challenging and time consuming. Moreover, accommodating an experimental set-up that allows us to compare multiple lines at the same time is not straightforward. Considering that NahG did not affect aphid performance (Figure 5A), we do not expect to see an effect on phloem ingestion.

    1. Author Response

      1) Please comment on why many of the June samples failed to provide sufficient sequence information, especially since not all of them had low yields (supp table 2 and supp figure 5).

      An extended paragraph about experimental intricacies of our study has been added to the Discussion. It has also been also slightly restructured to give a better and wider overview of how future freshwater monitoring studies using nanopore sequencing can be improved (page 18, lines 343-359).

      We wish to highlight that all three MinION sequencing runs here analysed feature substantially higher data throughput than that of any other recent environmental 16S rRNA sequencing study with nanopore technology, as recently reviewed by Latorre-Pérez et al. (Biology Methods and Protocols 2020, doi:10.1093/biomethods/bpaa016). One of this work's sequencing runs has resulted in lower read numbers for water samples collected in June 2018 (~0.7 Million), in comparison to the ones collected in April and August 2018 (~2.1 and ~5.5 Million, respectively). While log-scale variabilities between MinION flow cell throughput have been widely reported for both 16S and shotgun metagenomics approaches (e.g. see Latorre-Pérez et al.), the count of barcode-specific 16S reads is nevertheless expected to be correlated with the barcode-specific amount of input DNA within a given sequencing run. As displayed in Supplementary Figure 7b, we see a positive, possibly logarithmic trend between the DNA concentration after 16S rDNA amplification and number of reads obtained. With few exceptions (April-6, April-9.1 and Apri-9.2), we find that sample pooling with original 16S rDNA concentrations of ≳4 ng/µl also results in the surpassing of the here-set (conservative) minimum read threshold of 37,000 for further analyses. Conversely, all June samples that failed to reach 37,000 reads did not pass the input concentration of 4 ng/µl, despite our attempt to balance their quantity during multiplexing.

      We reason that such skews in the final barcode-specific read distribution would mainly arise from small concentration measurement errors, which undergo subsequent amplification during the upscaling with comparably large sample volume pipetting. While this can be compensated for by high overall flow cell throughput (e.g. see August-2, August-9.1, August-9.2), we think that future studies with much higher barcode numbers can circumvent this challenge by leveraging an exciting software solution: real-time selective sequencing via “Read Until”, as developed by Loose et al. (Nature Methods 2016, doi:10.1038/nmeth.3930). In the envisaged framework, incoming 16S read signals would be in situ screened for the sample-barcode which in our workflow is PCR-added to both the 5' and 3' end of each amplicon. Overrepresented barcodes would then be counterbalanced by targeted voltage inversion and pore "rejection" of such reads, until an even balance is reached. Lately, such methods have been computationally optimised, both through the usage of GPUs (Payne et al., bioRxiv 2020, https://doi.org/10.1101/2020.02.03.926956) and raw electrical signals (Kovaka et al., bioRxiv 2020, https://doi.org/10.1101/2020.02.03.931923).

      2) It would be helpful if the authors could mention the amount (or proportion) of their sequenced 16S amplicons that provided species-level identification, since this is one of the advantages of nanopore sequencing.

      We wish to emphasize that we intentionally refrained from reporting the proportion of 16S rRNA reads that could be classified at species level, since we are wary of any automated species level assignments even if the full-length 16S rRNA gene is being sequenced. While we list the reasons for this below, we appreciate the interest in the theoretical proportion of reads at species level assignment. We therefore re-analyzed our dataset, and now also provide the ratio of reads that could be classified at species level using Minimap2 (pages 16-17, lines 308-314).

      To this end, we classified reads at species level if the species entry of the respective SILVA v.132 taxonomic ID was either not empty, or neither uncultured bacterium nor metagenome. Therefore, many unspecified classifications such as uncultured species of some bacterial genus are counted as species-level classifications, rendering our approach lenient towards a higher ratio of species level classifications. Still, the species level classification ratios remain low, on average at 16.2 % across all included river samples (genus-level: 65.6 %, family level: 76.6 %). The mock community, on the other hand, had a much higher species classification rate (>80 % in all three replicates), which is expected for a well-defined, well-referenced and divergent composition of only eight bacterial taxa, and thus re-validates our overall classification workflow.

      On a theoretical level, we mainly refrain from automated across-the-board species level assignments because: (1) many species might differ by very few nucleotide differences within the 16S amplicon; distinguishing these from nanopore sequencing errors (here ~8 %) remains challenging (2) reference databases are incomplete and biased with respect to species level resolution, especially regarding certain environmental contexts; it is likely that species assignments would be guided by references available from more thoroughly studied niches than freshwater

      Other recent studies have also shown that across-the-board species-level classification is not yet feasible with 16S nanopore sequencing, for example in comparison with Illumina data (Acharya et al., Scientific Reports 2019, doi:10.25405/data.ncl.9693533) which showed that “more reliable information can be obtained at genus and family level”, or in comparison with longer 16S-ITS-23S amplicons (Cusco et al., F1000Research 2019, doi: 10.12688/f1000research.16817.2), which “remarkably improved the taxonomy assignment at the species level”.

      3) It is not entirely clear how the authors define their core microbiome. Are they reporting mainly the most abundant taxa (dominant core microbiome), and would this change if you look at a taxonomic rank below the family level? How does the core compare, for example, with other studies of this same river?

      The here-presented core microbiome indeed represents the most abundant taxa, with relatively consistent profiles between samples. We used hierarchical clustering (Figure 4a, C2 and C4) on the bacterial family level, together with relative abundance to identify candidate taxa. Filtering these for median abundance > 0.1% across all samples resulted in 27 core microbiome families. To clarify this for the reader, we have added a new paragraph to the Material and Methods (section 2.7; page 29, lines 653-658).

      We have also performed the same analysis on the bacterial genus level and now display the top 27 most abundant genera (median abundance > 0.2%), together with their corresponding families and hierarchical clustering analysis in a new Supplementary Figure 4. Overall, high robustness is observed with respect to the families of the core microbiome: out of the top 16 core families (Figure 4b), only the NS11-12 marine group family is not represented by the top 27 most abundant genera (Supplementary Figure 4b). We reason that this is likely because its corresponding genera are composed of relatively poorly resolved references of uncultured bacteria, which could thus not be further classified.

      To the best of our knowledge, there are only two other reports that feature metagenomic data of the River Cam and its wastewater influx sources (Rowe et al., Water Science & Technology 2016, doi:10.2166/wst.2015.634; Rowe et al., Journal of Antimicrobial Chemotherapy 2017, doi:10.1093/jac/dkx017). While both of these primarily focus on the diversity and abundance of antimicrobial resistance genes using Illumina shotgun sequencing, they only provide limited taxonomic resolution on the river's core microbiome. Nonetheless, Rowe et al. (2016) specifically highlighted Sphingobium as the most abundant genus in a source location of the river (Ashwell, Hertfordshire). This genus belongs to the family of Sphingomonadaceae, which is also among the five most dominant families identified in our dataset. It thus forms part of what we define as the core microbiome of the River Cam (Figure 4b), and we have therefore highlighted this consistency in our manuscript's Discussion (page 17, lines 316-319).

      4) Please consider revising the amount of information in some of the figures (such as figure 2 and figure 3). The resulting images are tiny, the legends become lengthy and the overall impact is reduced. Consider splitting these or moving some information to the supplements.

      To follow this advice, we have split Figure 2 into two less compact figures. We have moved more detailed analyses of our classification tool benchmark to the supplement (now Supplementary Figure 1). Supplementary Figure 1 notably also contains a new summary of the systematic computational performance measurements of each classification tool (see minor suggestions).

      Moreover, we here suggest that the original Figure 3 may be divided into two figures: one to visualise the sequencing output, data downsampling and distribution of the most abundant families (now Figure 3), and the other featuring the clustering of bacterial families and associated core microbiome (now Figure 4). We think that both the data summary and clustering/core microbiome analyses are of particular interest to the reader, and that they should be kept as part of the main analyses rather than the supplement – however, we are certainly happy to discuss alternative ideas with the reviewers and editors.

      5) Given that the authors claim to provide a simple, fast and optimized workflow it would be good to mention how this workflow differs or provides faster and better analysis than previous work using amplicon sequencing with a MinION sequencer.

      Data throughput, sequencing error rates and flow cell stability have seen rapid improvements since the commercial release of MinION in 2015. In consequence, bioinformatics community standards regarding raw data processing and integration steps are still lacking, as illustrated by a thorough recent benchmark of fast5 to fastq format "basecalling" methods (Wick et al., Genome Biology 2019, doi: 10.1186/s13059-019-1727-y).

      Early on during our analyses, we noticed that a plethora of bespoke pipelines have been reported in recent 16S environmental surveys using MinION (e.g. Kerkhof et al., Microbiome 2017, 10.1186/s40168-017-0336-9; Cusco et al., F1000 Research 2018, 10.12688/f1000research.16817.2; Acharya et al., Scientific Reports 2019, 10.1038/s41598-019-51997-x; Nygaard et al., Scientific Reports 2020, doi: 10.1038/s41598-020-59771-0). This underlines a need for more unified bioinformatics standards of (full-length) 16S amplicon data treatment, while similar benchmarks exist for short-read 16S metagenomics approaches, as well as for nanopore shotgun sequencing (e.g. Ye et al., Cell 2019, doi: 10.1016/j.cell.2019.07.010; Latorre-Pérez et al., Scientific Reports 2020, doi:10.1038/s41598-020-70491-3).

      By adding a thorough speed and memory usage summary (new Supplementary Figure 1b), in addition to our (mis)classification performance tests based on both mock and complex microbial community analyses, we provide the reader with a broad overview of existing options. While the widely used Kraken 2 and Centrifuge methods provide exceptional speed, we find that this comes with a noticeable tradeoff in taxonomic assignment accuracy. We reason that Minimap2 alignments provide a solid compromise between speed and classification performance, with the MAPseq software offering a viable alternative should memory usage limitation apply to users.

      We intend to extend this benchmarking process to future tools, and to update it on our GitHub page (https://github.com/d-j-k/puntseq). This page notably also hosts a range of easy-to-use scripts for employing downstream 16S analysis and visualization approaches, including ordination, clustering and alignment tests.

      The revised Discussion now emphasises the specific advancements of our study with respect to freshwater analysis and more general standardisation of nanopore 16S sequencing, also in contrast to previous amplicon nanopore sequencing approaches in which only one or two bioinformatics workflows were tested (page 16, lines 297-306).

      They also mention that nanopore sequencing is an "inexpensive, easily adaptable and scalable framework" The term "inexpensive" doesn't seem appropriate since it is relative. In addition, they should also discuss that although it is technically convenient in some aspects compared to other sequencers, there are still protocol steps that need certain reagents and equipment that is similar or the same to those needed for other sequencing platforms. Common bottlenecks such as DNA extraction methods, sample preservation and the presence of inhibitory compounds should be mentioned.

      We agree with the reviewers that “inexpensive” is indeed a relative term, which needs further clarification. We therefore now state that this approach is “cost-effective” and discuss future developments such as the 96-sample barcoding kits and Flongle flow cells for small-scale water diagnostics applications, which will arguably render lower per-sample analysis costs in the future (page 18, lines 361-365).

      Other investigators (e.g. Boykin et al., Genes 2019, doi:10.3390/genes10090632; Acharya et al., Water Technology 2020, doi:10.1016/j.watres.2020.116112) have recently shown that the full application of DNA extraction and in-field nanopore sequencing can be achieved at comparably low expense: Boykin et al. studied cassava plant pathogens using barcoded nanopore shotgun sequencing, and estimated costs of ~45 USD per sample, while we calculate ~100 USD per sample in this study. Acharya et al. undertook in situ water monitoring between Birtley, UK and Addis Ababa, Ethiopia, estimated ~75-150 USD per sample and purchased all necessary equipment for ~10,000 GBP – again, we think that this lies roughly within a similar range as our (local) study's total cost of ~3,670 GBP (Supplementary Table 6).

      The revised manuscript now mentions the possibility of increasing sequencing yield by improving DNA extraction methods, by taking sample storage and potential inhibitory compounds into account in the planning phase (page 18, lines 348-352).

      Minor points:

      -Please include a reference to the statement saying that the river Cam is notorious for the "infections such as leptospirosis".

      There are indeed several media reports that link leptospirosis risk to the local River Cam (e.g. https://www.cambridge-news.co.uk/news/cambridge-news/weils-disease-river-cam-leptosirosis-14919008 or https://www.bbc.com/news/uk-england-cambridgeshire-29060018). As we, however, did not find a scientific source for this information, we have slightly adjusted the statement in our manuscript from referring to Cambridge to instead referring to the entire United Kingdom. Accordingly, we now cite two reports from Public Health England (PHE) about serial leptospirosis prevalence in the United Kingdom (page 13, lines 226-227).

      -Please check figure 7 for consistency across panels, such as shading in violet and labels on the figures that do not seem to correspond with what is stated in the legend. Please mention what the numbers correspond to in outer ring. Check legend, where it says genes is probably genus.

      Thank you for pointing this out. We have revised (now labelled) Figure 8 and removed all inconsistencies between the panels. The legend has also been updated, which now includes a description of the number labelling of the tree, and a clearer differentiation between the colour coding of the tree nodes and the background highlighting of individual nanopore reads.

      -Page 6. There is a "data not shown" comment in the text: "Benchmarking of the classification tools on one aquatic sample further confirmed Minimap2's reliable performance in a complex bacterial community, although other tools such as SPINGO (Allard, Ryan, Jeffery, & Claesson, 2015), MAPseq (Matias Rodrigues, Schmidt, Tackmann, & von Mering, 2017), or IDTAXA (Murali et al., 2018) also produced highly concordant results despite variations in speed and memory usage (data not shown)." There appears to be no good reason that this data is not shown. In case the speed and memory usage was not recorded, is advisable to rerun the analysis and quantify these variables, rather than mentioning them and not reporting them. Otherwise, provide an explanation for not showing the data please.

      This is a valid point, and we agree with the reviewers that it is worth properly following up on this initial observation. To this end, our revised manuscript now entails a systematic characterisation of the twelve tools' runtime and memory usage performance. This has been added as Supplementary Figure 1b and under the new Materials and Methods section 2.2.4 (page 26, lines 556-562), while the corresponding results and their implications are discussed on page 16, lines 301-306. Particularly with respect to the runtime measurements, it is worth noting that these can differ by several orders of magnitude between the classifiers, thus providing an additional clarification on our choice of the - relatively fast - Minimap2 alignments.

      -In Figure 4, it would be important to calculate if the family PCA component contribution differences in time are differentially significant. In Panel B, depicted is the most evident variance difference but what about other taxa which might not be very abundant but differ in time? One can use the fitFeatureModel function from the metagenomeSeq R library and a P-adjusted threshold value of 0.05, to validate abundance differences in addition to your analysis.

      To assess if the PC component contribution of Figure 5 (previously Figure 4) significantly differed between the three time points, we have applied non-parametric tests to all season-grouped samples except for the mock community controls. We first applied Kruskal-Wallis H-test for independent samples, followed by post-hoc comparisons using two-sided Mann-Whitney U rank tests.

      The Kruskal-Wallis test established a significant difference in PC component contributions between the three time points (p = 0.0049), with most of the difference stemming from divergence between April and August samples according to the post-hoc tests (p = 0.0022). The June sampled seemed to be more similar to the August ones (p = 0.66) than to the ones from April (p = 0.11), recapitulating the results of our hierarchical clustering analysis (Figure 4a).

      We have followed the reviewers' advice and applied a complementary approach, using the fitFeatureModel of metagenomeSeq to fit a zero-inflated log-normal mixture model of each bacterial taxon against the time points. As only three independent variables can be accounted for by the model (including the intercept), we have chosen to investigate the difference between the spring (April) and summer (June, August) months to capture the previously identified difference between these months. At a nominal P-value threshold of 0.05, this analysis identifies seven families to significantly differ in their relative composition between spring and summer, namely Cyanobiaceae, Armatimonadaceae, Listeriaceae, Carnobacteriaceae, Azospirillaceae, Cryomorphaceae, and Microbacteriaceae. Three out of these seven families were also detected by the PCA component analysis (Carnobacteriacaea, Azospirillaceae, Microbacteriaceae) and two more (Listeriacaea, Armatimonadaceae) occured in the top 15 % of that analysis (out of 357 families).

      This approach represents a useful validation of our principal component analysis' capture of likely seasonal divergence, but moreover allows for a direct assessment of differential bacterial composition across time points. We have therefore integrated the analysis into our manuscript (page 10, lines 184-186; Materials and Methods section 2.6, page 29, lines 641-647) – thank you again for this suggestion.

      -Page 12-13. In the paragraph: "Using multiple sequence alignments between nanopore reads and pathogenic species references, we further resolved the phylogenies of three common potentially pathogenic genera occurring in our river samples, Legionella, Salmonella and Pseudomonas (Figure 7a-c; Material and Methods). While Legionella and Salmonella diversities presented negligible levels of known harmful species, a cluster of reads in downstream sections indicated a low abundance of the opportunistic, environmental pathogen Pseudomonas aeruginosa (Figure 7c). We also found significant variations in relative abundances of the Leptospira genus, which was recently described to be enriched in wastewater effluents in Germany (Numberger et al., 2019) (Figure 7d)."

      Here it is important to mention the relative abundance in the sample. While no further experiments are needed, the authors should mention and discuss that the presence of DNA from pathogens in the sample has to be confirmed by other microbiology methodologies, to validate if there are viable organisms. Definitively, it is a big warning finding pathogen's DNA but also, since it is characterized only at genus level, further investigation using whole metagenome shotgun sequencing or isolation, would be important.

      We agree that further microbiological assays, particularly target-specific species isolation and culturing, would be essential to validate the presence of living pathogenic bacteria. Accordingly, our revised Discussion now contains a paragraph that encourages such experiments as part of the design of future studies (with a fully-equipped laboratory infrastructure); page 17, 338-341.

      -Page 15: "This might help to establish this family as an indicator for bacterial community shifts along with water temperature fluctuations."

      Temperature might not be the main factor for the shift. There could be other factors that were not measured that could contribute to this shift. There are several parameters that are not measured and are related to water quality (COD, organic matter, PO4, etc).

      We agree that this was a simplified statement, given our currently limited number of samples, and have therefore slightly expanded on this point (page 17, lines 323-325). It is indeed possible that differential Carnobacteriaceae abundances between the time point measurements may have arisen not as a consequence of temperature fluctuations (alone), but instead as a consequence of the observed hydrochemical changes like e.g. Ca2+, Mg2+, HCO3- (Figure 6b-c) or possible even water flow speed reductions (Supplementary Figure 6d).

      -"A number of experimental intricacies should be addressed towards future nanopore freshwater sequencing studies with our approach, mostly by scrutinising water DNA extraction yields, PCR biases and molar imbalances in barcode multiplexing (Figure 3a; Supplementary Figure 5)."

      Here you could elaborate more on the challenges, as mentioned previously.

      We realise that we had not discussed the challenges in enough detail, and the Discussion now contains a substantially more detailed description of these intricacies (page 18, lines 343-359).

    1. Author Response

      Reviewer #1:

      Summary:

      In this paper, the authors utilize CRISPR-Cas9 to generate two different DMD cell lines. The first is a DMD human myoblast cell line that lacks exon 52 within the dystrophin gene. The second is a DMD patient cell line that is missing miRNA binding sites within the regulatory regions of the utrophin gene, resulting in increased utrophin expression. Then, the authors proceeded to test antisense oligonucleotides and utrophin up-regulators in these cell lines.

      Overall opinion (expanded in more detail below).

      The paper suffers from the following weaknesses:

      1) The protocol used to generate the myoblast cell lines is rather inefficient and is not new.

      2) Many of the data figures are of low quality and are missing proper controls (detailed in points 5,7,10, 12, 13,14)

      Detailed critiques:

      1) The title needs to be changed. The method used by the authors is inefficient. The title should instead focus on the two cell lines generated.

      We appreciate the reviewer’s comments: thanks to them, we have realized the focus of the manuscript should be in the new models we described and less in the methodology used to create them.

      Originally, we wanted to share the problems we faced when applying new CRISPR/Cas9 edition techniques to myoblasts: our conversations with other researchers in the field confirmed that many were having similar problems. However, the reviewer is right in the fact that there are many ways around this problem. We do describe ours and we are working in a new version of the manuscript with additional data to characterize our new models further and where the method used to create them, although included, is not the main focus of the manuscript. In this new version we will change the title accordingly.

      2) Line 104: The authors declare that the efficiency of CRISPR/Cas9 is currently too low to provide therapeutic benefit for DMD in vivo. There are lots of papers that show efficient recovery of dystrophin in small and large animals following CRISPR/Cas9 therapy. The authors should cite them properly.

      Thank you for your appreciation. We have reviewed the literature again to include new evidences of efficient dystrophin recovery as well as other studies with lower efficiency.

      3) Figures 1, 2,3, and 4 can be merged into one figure.

      4) Figure 2A and 2B can be moved to supplementary.

      5) Figure 2C and 2D are not clear. Are the duplicates the same? Please invert the black and white colors of the blots.

      Thank you for your comments. We have inverted the colors of the blots and changed the marks used in figure 2C and 2D to clarify that duplicates are indeed the same sample, assayed in duplicates. We have also merged figures 1 and 4 and moved figures 2 and 3 to supplementary in this new version.

      6) Figure 3: In order to optimize the efficiency of myoblast transfection, the plasmids containing the Cas9 and the sgRNA should have different fluorophores (GFP and mCherry). This approach would increase the percentage of positive edited clones among the clones sorted.

      We think the reviewer may have misunderstood our methodology: we are not using a plasmid with the Cas9 and another with the sgRNA, we are using two plasmids, both containing Cas9 and each a different sgRNA. We did try to use two different plasmids, one expressing GFP and one expressing puromycin resistance, but we found out that single GFP positive cell selection plus puromycin selection was too inefficient. We could have tried with two different fluorophores, but we tested the tools we had in our hands first and were successful at obtaining enough clones to continue with their characterization, so we did so instead of a further optimization to our editing protocol.

      7) Figure 4A: In the text, the authors state that only 1 clone had the correct genomic edit, but from the PCR genotyping in this figure shows at least 2 positive clones (number 4 and 7).

      Thank you for your appreciation. As you said, we got two positive clones (as we also indicate in figure 3B) but we completed the full characterization of one of them (clone number 7= DMD-UTRN-Model). In the new version of the manuscript we explain this further.

      8) Figure 4C: The authors should address whether one or both copies of the UTRN gene was edited in their clones.

      Thank you for your comment. Both copies of the UTRN gene were edited in our clones. We have included this information both in the text and in the figure 4 legend.

      9) Figure 4 B and D: The authors should report the sequence below the electropherograms.

      Thank you for this correction, we have included the sequence under the electropherograms.

      10) Figure 5B: This western blot is of poor quality. Also, the authors should specify that the samples are differentiated myoblasts. Lastly, a standard protein should be included as a loading control.

      Thank you for your comment. Poor quality of dystrophin and utrophin western blots was the main reason to validate a new method in our laboratory to measure these proteins directly in cell culture (1) like an alternative to western blotting. Since then, the myoblot method has been routinely used by us and in collaboration with other groups and companies. We included the western blot as it is sometimes easier for those used to this technique to be able to assess a blot in which there is no dystrophin expression. As you pointed out, our samples were all differentiated myotubes, not myoblasts, and we have modified this accordingly. Thank you very much for pointing out this mistake

      On the other hand, as described in the methods, Revert TM 700 Total Protein Stain (Li-Cor) and alpha-actinin were included as standards in dystrophin and utrophin western blots, respectively.

      11) Figure 5E: We would like to see triplicates for the level of Utrophin expression.

      We thank the reviewer for his/her recommendation, but we do not consider western blotting a good quantitative technique, we have included western blots to show the expression/absence of protein at the same level. We have included many more replicates than needed to show at the level of utrophin by myoblots. We acknowledge that western blotting is the preferred method for some reviewers, so in the new version of our manuscript we clearly indicate the value we give to each technique, being myoblots our choice for quantification.

      12) Figure 6: A dystrophin western blot should be included to demonstrate protein recovery following antisense oligonucleotide treatment. Also, the RT-PCR data could be biased as you can have preferential amplification of shorter fragments.

      Thank you for your recommendation but as we have explained before, myoblots have been validated in our laboratory to replace western blot for accurate dystrophin quantification in cell culture.

      13) Figure 6A: Invert the black and white colors. The authors should also report the control sequences and sequences of the clones under the electropherograms.

      Thank you for your suggestion, we have inverted the colors and added the sequences under the electropherograms.

      14) Figure 6B: Control myoblasts should be included in figure 5C.

      Thank you for this correction, we will include control myoblasts in the new manuscript version.

      15) Figure S2A: Invert the black and white colors.

      Thank you for your suggestion, we have inverted the colors.

      Reviewer #2:

      The work from Soblechero-Martín et al reports the generation of a human DMD line deleted for exon 52 using CRISPR technology. In addition, the authors introduced a second mutation that leads to upregulation of utrophin, a protein similar to dystrophin, which has been considered as a therapeutic surrogate. The authors provide a careful description of the methodology used to generate the new cell line and have conducted meticulous evaluations to test the validity of the reagents.

      However, if the main purpose of this cell line is to perform drug or small molecule compound screenings, a single line might not be sufficient to draw robust conclusions. The generation of additional DMD lines in different genetic backgrounds using the reagents developed in this study will strengthen the work and will be of interest to the DMD field.

      Thank you for your appreciation. We think that a well characterized immortalized culture, like the one we describe is sufficient for compound screening, as described in other recently published studies (2), (3). About the other suggestion, we have indeed used our method to generate other cultures for collaborators, but they will be reported in their own publications, as they are interested in them as tools in their own research projects.

      Further, the future use of the edited DMD line with upregulated utrophin is unclear. The utrophin upregulation adds a complexity to this line that might complicate the assessment of screened compounds. In contrast, this line could be used to test if overexpression of utrophin generates myotubes that produce increased force compared to the control DMD line.

      We think we may have not explained our screening platform well enough. Our suggestion is to offer our newly generated culture ALONGSIDE the original unedited culture: the original is treated with potential drug candidates, while the new one may or may not be treated, if these drug candidates are thought to act by activating the edited region (see an example in the figure below). In this case, the new culture will be a reliable positive control to the effects that may be reported in the unedited cultures by the drug candidates. We will make this clear in the new version of the manuscript.

      Created with BioRender.com

      In summary, while there is support and enthusiasm for the techniques and methodological approach of the study, the future use of this single line might be dubious and could be strengthened if additional lines are generated.

      We share the reviewer’s enthusiasm for this approach, and we have included in the new version of the manuscript further characterization of this new cell culture that we think would demonstrate its usefulness better.

    1. Author Response:

      Evaluation Summary:

      Since DBS of the habenula is a new treatment, these are the first data of its kind and potentially of high interest to the field. Although the study mostly confirms findings from animal studies rather than bringing up completely new aspects of emotion processing, it certainly closes a knowledge gap. This paper is of interest to neuroscientists studying emotions and clinicians treating psychiatric disorders. Specifically the paper shows that the habenula is involved in processing of negative emotions and that it is synchronized to the prefrontal cortex in the theta band. These are important insights into the electrophysiology of emotion processing in the human brain.

      The authors are very grateful for the reviewers’ positive comments on our study. We also thank all the reviewers for the comments which has helped to improve the manuscript.

      Reviewer #1 (Public Review):

      The study by Huang et al. report on direct recordings (using DBS electrodes) from the human habenula in conjunction with MEG recordings in 9 patients. Participants were shown emotional pictures. The key finding was a transient increase in theta/alpha activity with negative compared to positive stimuli. Furthermore, there was a later increase in oscillatory coupling in the same band. These are important data, as there are few reports of direct recordings from the habenula together with the MEG in humans performing cognitive tasks. The findings do provide novel insight into the network dynamics associated with the processing of emotional stimuli and particular the role of the habenula.

      Recommendations:

      How can we be sure that the recordings from the habenula are not contaminated by volume conduction; i.e. signals from neighbouring regions? I do understand that bipolar signals were considered for the DBS electrode leads. However, high-frequency power (gamma band and up) is often associated with spiking/MUA and considered less prone to volume conduction. I propose to also investigate that high-frequency gamma band activity recorded from the bipolar DBS electrodes and relate to the emotional faces. This will provide more certainty that the measured activity indeed stems from the habenula.

      We thank the reviewer for the comment. As the reviewer pointed out, bipolar macroelectrode can detect locally generated potentials, as demonstrated in the case of recordings from subthalamic nucleus and especially when the macroelectrodes are inside the subthalamic nucleus (Marmor et al., 2017). However, considering the size of the habenula and the size of the DBS electrode contacts, we have to acknowledge that we cannot completely exclude the possibility that the recordings are contaminated by volume conduction of activities from neighbouring areas, as shown in Bertone-Cueto et al. 2019. We have now added extra information about the size of the habenula and acknowledged the potential contamination of activities from neighbouring areas through volume conduction in the ‘Limitation’:

      "Another caveat we would like to acknowledge that the human habenula is a small region. Existing data from structural MRI scans reported combined habenula (the sum of the left and right hemispheres) volumes of ~ 30–36 mm3 (Savitz et al., 2011a; Savitz et al., 2011b) which means each habenula has the size of 2~3 mm in each dimension, which may be even smaller than the standard functional MRI voxel size (Lawson et al., 2013). The size of the habenula is also small relative to the standard DBS electrodes (as shown in Fig. 2A). The electrodes used in this study (Medtronic 3389) have electrode diameter of 1.27 mm with each contact length of 1.5 mm, and contact spacing of 0.5 mm. We have tried different ways to confirm the location of the electrode and to select the contacts that is within or closest to the habenula: 1.) the MRI was co-registered with a CT image (General Electric, Waukesha, WI, USA) with the Leksell stereotactic frame to obtain the coordinate values of the tip of the electrode; 2.) Post-operative CT was co-registered to pre-operative T1 MRI using a two-stage linear registration using Lead-DBS software. We used bipolar signals constructed from neighbouring macroelectrode recordings, which have been shown to detect locally generated potentials from subthalamic nucleus and especially when the macroelectrodes are inside the subthalamic nucleus (Marmor et al., 2017). Considering that not all contacts for bipolar LFP construction are in the habenula in this study, as shown in Fig. 2, we cannot exclude the possibility that the activities we measured are contaminated by activities from neighbouring areas through volume conduction. In particular, the human habenula is surrounded by thalamus and adjacent to the posterior end of the medial dorsal thalamus, so we may have captured activities from the medial dorsal thalamus. However, we also showed that those bipolar LFPs from contacts in the habenula tend to have a peak in the theta/alpha band in the power spectra density (PSD); whereas recordings from contacts outside the habenula tend to have extra peak in beta frequency band in the PSD. This supports the habenula origin of the emotional valence related changes in the theta/alpha activities reported here."

      We have also looked at gamma band oscillations or high frequency activities in the recordings. However, we didn’t observe any peak in high frequency band in the average power spectral density, or any consistent difference in the high frequency activities induced by the emotional stimuli (Fig. S1). We suspect that high frequency activities related to MUA/spiking are very local and have very small amplitude, so they are not picked up by the bipolar LFPs measured from contacts with both the contact area for each contact and the between-contact space quite large comparative to the size of the habenula.

      A

      B

      Figure S1. (A) Power spectral density of habenula LFPs across all time period when emotional stimuli were presented. The bold blue line and shadowed region indicates the mean ± SEM across all recorded hemispheres and the thin grey lines show measurements from individual hemispheres. (B) Time-frequency representations of the power response relative to pre-stimulus baseline for different conditions showing habenula gamma and high frequency activity are not modulated by emotional

      References:

      Savitz JB, Bonne O, Nugent AC, Vythilingam M, Bogers W, Charney DS, et al. Habenula volume in post-traumatic stress disorder measured with high-resolution MRI. Biology of Mood & Anxiety Disorders 2011a; 1(1): 7.

      Savitz JB, Nugent AC, Bogers W, Roiser JP, Bain EE, Neumeister A, et al. Habenula volume in bipolar disorder and major depressive disorder: a high-resolution magnetic resonance imaging study. Biological Psychiatry 2011b; 69(4): 336-43.

      Lawson RP, Drevets WC, Roiser JP. Defining the habenula in human neuroimaging studies. NeuroImage 2013; 64: 722-7.

      Marmor O, Valsky D, Joshua M, Bick AS, Arkadir D, Tamir I, et al. Local vs. volume conductance activity of field potentials in the human subthalamic nucleus. Journal of Neurophysiology 2017; 117(6): 2140-51.

      Bertone-Cueto NI, Makarova J, Mosqueira A, García-Violini D, Sánchez-Peña R, Herreras O, et al. Volume-Conducted Origin of the Field Potential at the Lateral Habenula. Frontiers in Systems Neuroscience 2019; 13:78.

      Figure 3: the alpha/theta band activity is very transient and not band-limited. Why refer to this as oscillatory? Can you exclude that the TFRs of power reflect the spectral power of ERPs rather than modulations of oscillations? I propose to also calculate the ERPs and perform the TFR of power on those. This might result in a re-interpretation of the early effects in theta/alpha band.

      We agree with the reviewer that the activity increase in the first time window with short latency after the stimuli onset is very transient and not band-limited. This raise the question that whether this is oscillatory or a transient evoked activity. We have now looked at this initial transient activity in different ways: 1.) We quantified the ERP in LFPs locked to the stimuli onset for each emotional valence condition and for each habenula. We investigated whether there was difference in the amplitude or latency of the ERP for different stimuli emotional valence conditions. As showing in the following figure, there is ERP with stimuli onset with a positive peak at 402 ± 27 ms (neutral stimuli), 407 ± 35 ms (positive stimuli), 399 ± 30 ms (negative stimuli). The flowing figure (Fig. 3–figure supplement 1) will be submitted as figure supplement related to Fig. 3. However, there was no significant difference in ERP latency or amplitude caused by different emotional valence stimuli. 2.) We have quantified the pure non-phase-locked (induced only) power spectra by calculating the time-frequency power spectrogram after subtracting the ERP (the time-domain trial average) from time-domain neural signal on each trial (Kalcher and Pfurtscheller, 1995; Cohen and Donner, 2013). This shows very similar results as we reported in the main manuscript, as shown in Fig. 3–figure supplement 2. These further analyses show that even though there were event related potential changes time locked around the stimuli onset, and this ERP did NOT contribute to the initial broad-band activity increase at the early time window shown in plot A-C in Figure 3. The figures of the new analyses and following have now been added in the main text:

      "In addition, we tested whether stimuli-related habenula LFP modulations primarily reflect a modulation of oscillations, which is not phase-locked to stimulus onset, or, alternatively, if they are attributed to evoked event-related potential (ERP). We quantified the ERP for each emotional valence condition for each habenula. There was no significant difference in ERP latency or amplitude caused by different emotional valence stimuli (Fig. 3–figure supplement 1). In addition, when only considering the non phase-locked activity by removing the ERP from the time series before frequency-time decomposition, the emotional valence effect (presented in Fig. 3–figure supplement 2) is very similar to those shown in Fig.3. These additional analyses demonstrated that the emotional valence effect in the LFP signal is more likely to be driven by non-phase-locked (induced only) activity."

      A

      B

      Fig. 3–figure supplement 1. Event-related potential (ERP) in habenula LFP signals in different emotional valence (neutral, positive and negative) conditions. (A) Averaged ERP waveforms across patients for different conditions. (B) Peak latency and amplitude (Mean ± SEM) of the ERP components for different conditions.

      Fig. 3–figure supplement 2. Non-phase-locked activity in different emotional valence (neutral, positive and negative) conditions (N = 18). (A) Time-frequency representation of the power changes relative to pre-stimulus baseline for three conditions. Significant clusters (p < 0.05, non-parametric permutation test) are encircled with a solid black line. (B) Time-frequency representation of the power response difference between negative and positive valence stimuli, showing significant increased activity the theta/alpha band (5-10 Hz) at short latency (100-500 ms) and another increased theta activity (4-7 Hz) at long latencies (2700-3300 ms) with negative stimuli (p < 0.05, non-parametric permutation test). (C) Normalized power of the activities at theta/alpha (5-10 Hz) and theta (4-7 Hz) band over time. Significant difference between the negative and positive valence stimuli is marked by a shadowed bar (p < 0.05, corrected for multiple comparison).

      References:

      Kalcher J, Pfurtscheller G. Discrimination between phase-locked and non-phase-locked event-related EEG activity. Electroencephalography and Clinical Neurophysiology 1995; 94(5): 381-4.

      Cohen MX, Donner TH. Midfrontal conflict-related theta-band power reflects neural oscillations that predict behavior. Journal of Neurophysiology 2013; 110(12): 2752-63.

      Figure 4D: can you exclude that the frontal activity is not due to saccade artifacts? Only eye blink artifacts were reduced by the ICA approach. Trials with saccades should be identified in the MEG traces and rejected prior to further analysis.

      We understand and appreciate the reviewer’s concern on the source of the activity modulations shown in Fig. 4D. We tried to minimise the eye movement or saccade in the recording by presenting all figures at the centre of the screen, scaling all presented figures to similar size, and presenting a white cross at the centre of the screen preparing the participants for the onset of the stimuli. Despite this, participants my still make eye movements and saccade in the recording. We used ICA to exclude the low frequency large amplitude artefacts which can be related to either eye blink or other large eye movements. However, this may not be able to exclude artefacts related to miniature saccades. As shown in Fig. 4D, on the sensor level, the sensors with significant difference between the negative vs. positive emotional valence condition clustered around frontal cortex, close to the eye area. However, we think this is not dominated by saccades because of the following two reasons:

      1.) The power spectrum of the saccadic spike artifact in MEG is characterized by a broadband peak in the gamma band from roughly 30 to 120 Hz (Yuval-Greenberg et al., 2008; Keren et al., 2010). In this study the activity modulation we observed in the frontal sensors are limited to the theta/alpha frequency band, so it is different from the power spectra of the saccadic spike artefact.

      2.) The source of the saccadic spike artefacts in MEG measurement tend to be localized to the region of the extraocular muscles of both eyes (Carl et al., 2012).We used beamforming source localisation to identify the source of the activity modulation reported in Fig. 4D. This beamforming analysis identified the source to be in the Broadmann area 9 and 10 (shown in Fig. 5). This excludes the possibility that the activity modulation in the sensor level reported in Fig. 4D is due to saccades. In addition, Broadman area 9 and 10, have previously been associated with emotional stimulus processing (Bermpohl et al., 2006), Broadman area 9 in the left hemisphere has also been used as the target for repetitive transcranial magnetic stimulation (rTMS) as a treatment for drug-resistant depression (Cash et al., 2020). The source localisation results, together with previous literature on the function of the identified source area suggest that the activity modulation we observed in the frontal cortex is very likely to be related to emotional stimuli processing.

      References:

      Yuval-Greenberg S, Tomer O, Keren AS, Nelken I, Deouell LY. Transient induced gamma-band response in EEG as a manifestation of miniature saccades. Neuron 2008; 58(3): 429-41.

      Keren AS, Yuval-Greenberg S, Deouell LY. Saccadic spike potentials in gamma-band EEG: characterization, detection and suppression. NeuroImage 2010; 49(3): 2248-63.

      Carl C, Acik A, Konig P, Engel AK, Hipp JF. The saccadic spike artifact in MEG. NeuroImage 2012; 59(2): 1657-67.

      Bermpohl F, Pascual-Leone A, Amedi A, Merabet LB, Fregni F, Gaab N, et al. Attentional modulation of emotional stimulus processing: an fMRI study using emotional expectancy. Human Brain Mapping 2006; 27(8): 662-77.

      Cash RFH, Weigand A, Zalesky A, Siddiqi SH, Downar J, Fitzgerald PB, et al. Using Brain Imaging to Improve Spatial Targeting of Transcranial Magnetic Stimulation for Depression. Biological Psychiatry 2020.

      The coherence modulations in Fig 5 occur quite late in time compared to the power modulations in Fig 3 and 4. When discussing the results (in e.g. the abstract) it reads as if these findings are reflecting the same process. How can the two effect reflect the same process if the timing is so different?

      As the reviewer pointed out correctly, the time window where we observed the coherence modulations happened quite late in time compared to the initial power modulations in the frontal cortex and the habenula (Fig. 4). And there was another increase in the theta band activities in the habenula area even later, at around 3 second after stimuli onset when the emotional figure has already disappeared. Emotional response is composed of a number of factors, two of which are the initial reactivity to an emotional stimulus and the subsequent recovery once the stimulus terminates or ceases to be relevant (Schuyler et al., 2014). We think these neural effects we observed in the three different time windows may reflect different underlying processes. We have discussed this in the ‘Discussion’:

      "These activity changes at different time windows may reflect the different neuropsychological processes underlying emotion perception including identification and appraisal of emotional material, production of affective states, and autonomic response regulation and recovery (Phillips et al., 2003a). The later effects of increased theta activities in the habenula when the stimuli disappeared were also supported by other literature showing that, there can be prolonged effects of negative stimuli in the neural structure involved in emotional processing (Haas et al., 2008; Puccetti et al., 2021). In particular, greater sustained patterns of brain activity in the medial prefrontal cortex when responding to blocks of negative facial expressions was associated with higher scores of neuroticism across participants (Haas et al., 2008). Slower amygdala recovery from negative images also predicts greater trait neuroticism, lower levels of likability of a set of social stimuli (neutral faces), and declined day-to-day psychological wellbeing (Schuyler et al., 2014; Puccetti et al., 2021)."

      References:

      Schuyler BS, Kral TR, Jacquart J, Burghy CA, Weng HY, Perlman DM, et al. Temporal dynamics of emotional responding: amygdala recovery predicts emotional traits. Social Cognitive and Affective Neuroscience 2014; 9(2): 176-81.

      Phillips ML, Drevets WC, Rauch SL, Lane R. Neurobiology of emotion perception I: The neural basis of normal emotion perception. Biological Psychiatry 2003a; 54(5): 504-14.

      Haas BW, Constable RT, Canli T. Stop the sadness: Neuroticism is associated with sustained medial prefrontal cortex response to emotional facial expressions. NeuroImage 2008; 42(1): 385-92.

      Puccetti NA, Schaefer SM, van Reekum CM, Ong AD, Almeida DM, Ryff CD, et al. Linking Amygdala Persistence to Real-World Emotional Experience and Psychological Well-Being. Journal of Neuroscience 2021: JN-RM-1637-20.

      Be explicit on the degrees of freedom in the statistical tests given that one subject was excluded from some of the tests.

      We thank the reviewers for the comment. The number of samples used for each statistics analysis are stated in the title of the figures. We have now also added the degree of freedom in the main text when parametric statistical tests such as t-test or ANOVAs have been used. When permutation tests (which do not have any degrees of freedom associated with it) are used, we have now added the number of samples for the permutation test.

      Reviewer #2 (Public Review):

      In this study, Huang and colleagues recorded local field potentials from the lateral habenula in patients with psychiatric disorders who recently underwent surgery for deep brain stimulation (DBS). The authors combined these invasive measurements with non-invasive whole-head MEG recordings to study functional connectivity between the habenula and cortical areas. Since the lateral habenula is believed to be involved in the processing of emotions, and negative emotions in particular, the authors investigated whether brain activity in this region is related to emotional valence. They presented pictures inducing negative and positive emotions to the patients and found that theta and alpha activity in the habenula and frontal cortex increases when patients experience negative emotions. Functional connectivity between the habenula and the cortex was likewise increased in this band. The authors conclude that theta/alpha oscillations in the habenula-cortex network are involved in the processing of negative emotions in humans.

      Because DBS of the habenula is a new treatment tested in this cohort in the framework of a clinical trial, these are the first data of its kind. Accordingly, they are of high interest to the field. Although the study mostly confirms findings from animal studies rather than bringing up completely new aspects of emotion processing, it certainly closes a knowledge gap.

      In terms of community impact, I see the strengths of this paper in basic science rather than the clinical field. The authors demonstrate the involvement of theta oscillations in the habenula-prefrontal cortex network in emotion processing in the human brain. The potential of theta oscillations to serve as a marker in closed-loop DBS, as put forward by the authors, appears less relevant to me at this stage, given that the clinical effects and side-effects of habenula DBS are not known yet.

      We thank the reviewers for the favourable comments about the implication of our study in basic science and about the value of our study in closing a knowledge gap. We agree that further studies would be required to make conclusions about the clinical effects and side-effects of habenula DBS.

      Detailed comments:

      The group-average MEG power spectrum (Fig. 4B) suggests that negative emotions lead to a sustained theta power increase and a similar effect, though possibly masked by a visual ERP, can be seen in the habenula (Fig. 3C). Yet the statistics identify brief elevations of habenula theta power at around 3s (which is very late), a brief elevation of prefrontal power a time 0 or even before (Fig. 4C) and a brief elevation of Habenula-MEG theta coherence around 1 s. It seems possible that this lack of consistency arises from a low signal-to-noise ratio. The data contain only 27 trails per condition on average and are contaminated by artifacts caused by the extension wires.

      With regard to the nature of the activity modulation with short latency after stimuli onset: whether this is an ERP or oscillation? We have now investigated this. In summary, by analysing the ERP and removing the influence of the ERP from the total power spectra, we didn’t observe stimulus emotional valence related modulation in the ERP, and the modulation related to emotional valence in the pure induced (non-phase-locked) power spectra was similar to what we have observed in the total power shown in Fig. 3. Therefore, we argue that the theta/alpha increase with negative emotional stimuli we observed in both habenula and prefrontal cortex 0-500 ms after stimuli onset are not dominated by visual or other ERP.

      With regard to the signal-to-noise ratio from only 27 trials per condition on average per participant: We have tried to clean the data by removing the trials with obvious artefacts characterised by increased measurements in the time domain over 5 times the standard deviation and increased activities across all frequency bands in the frequency domain. After removing the trials with artefacts, we have 27 trials per condition per subject on average. We agree that 27 trials per condition on average is not a high number, and increasing the number of trials would further increase the signal-to-noise ratio. However, our studies with EEG recordings and LFP recordings from externalised patients have shown that 30 trials was enough to identify reduction in the amplitude of post-movement beta oscillations at the beginning of visuomotor adaption in the motor cortex and STN (Tan et al., 2014a; Tan et al., 2014b). These results of motor error related modulation in the post-movement beta have been repeated by other studies from other groups. In Tan et al. 2014b, with simultaneous EEG and STN LFP measurements and a similar number of trials (around 30), we also quantified the time-course of STN-motor cortex coherence during voluntary movements. This pattern has also been repeated in a separate study from another group with around 50 trials per participant (Talakoub et al., 2016). In addition, similar behavioural paradigm (passive figure viewing paradigm) has been used in two previous studies with LFP recordings from STN from different patient groups (Brucke et al., 2007; Huebl et al., 2014). In both studies, a similar number of trials per condition around 27 was used. The authors have identified meaningful activity modulation in the STN by emotional stimuli. Therefore, we think the number of trials per condition was sufficient to identify emotional valence induced difference in the LFPs in the paradigm.

      We agree that the measurement of coherence can be more susceptible to noise and suffer from the reduced signal-to-noise ratio in MEG recording. In Hirschmann et al. 2013, 5 minutes of resting recording and 5 minutes of movement recording from 10 PD patients were used to quantify movement related changes in STN-cortical coherence and how this was modulated by levodopa (Hirschmann et al., 2013). Litvak et al. (2012) have identified movement-related changes in the coherence between STN LFP and motor cortex with recording with simultaneous STN LFP and MEG recordings from 17 PD patients and 20 trials in average per participant per condition (Litvak et al., 2012). With similar methods, van Wijk et al. (2017) used recordings from 9 patients and around on average in 29 trials per hand per condition, and they identified reduced cortico-pallidal coherence in the low-beta decreases during movement (van Wijk et al., 2017). So the trial number per condition participant we used in this study are comparable to previous studies.

      The DBS extension wires do reduce signal-to-noise ratio in the MEG recording. therefore the spatiotemporal Signal Space Separation (tSSS) method (Taulu and Simola, 2006) implemented in the MaxFilter software (Elekta Oy, Helsinki, Finland) has been applied in this study to suppress strong magnetic artifacts caused by extension wires. This method has been proved to work well in de-noising the magnetic artifacts and movement artifacts in MEG data in our previous studies (Cao et al., 2019; Cao et al., 2020). In addition, the beamforming method proposed by several studies (Litvak et al., 2010; Hirschmann et al., 2011; Litvak et al., 2011) has been used in this study. In Litvak et al., 2010, the artifacts caused by DBS extension wires was detailed described and the beamforming was demonstrated to effectively suppress artifacts and thereby enable both localization of cortical sources coherent with the deep brain nucleus. We have now added more details and these references about the data cleaning and the beamforming method in the main text. With the beamforming method, we did observe the standard movement-related modulation in the beta frequency band in the motor cortex with 9 trials of figure pressing movements, shown in the following figure for one patient as an example (Figure 5–figure supplement 1). This suggests that the beamforming method did work well to suppress the artefacts and help to localise the source with a low number of trials. The figure on movement-related modulation in the motor cortex in the MEG signals have now been added as a supplementary figure to demonstrate the effect of the beamforming.

      Figure 5–figure supplement 1. (A) Time-frequency maps of MEG activity for right hand button press at sensor level from one participant (Case 8). (B) DICS beamforming source reconstruction of the areas with movement-related oscillation changes in the range of 12-30 Hz. The peak power was located in the left M1 area, MNI coordinate [-37, -12, 43].

      References:

      Tan H, Jenkinson N, Brown P. Dynamic neural correlates of motor error monitoring and adaptation during trial-to-trial learning. Journal of Neuroscience 2014a; 34(16): 5678-88.

      Tan H, Zavala B, Pogosyan A, Ashkan K, Zrinzo L, Foltynie T, et al. Human subthalamic nucleus in movement error detection and its evaluation during visuomotor adaptation. Journal of Neuroscience 2014b; 34(50): 16744-54.

      Talakoub O, Neagu B, Udupa K, Tsang E, Chen R, Popovic MR, et al. Time-course of coherence in the human basal ganglia during voluntary movements. Scientific Reports 2016; 6: 34930.

      Brucke C, Kupsch A, Schneider GH, Hariz MI, Nuttin B, Kopp U, et al. The subthalamic region is activated during valence-related emotional processing in patients with Parkinson's disease. European Journal of Neuroscience 2007; 26(3): 767-74.

      Huebl J, Spitzer B, Brucke C, Schonecker T, Kupsch A, Alesch F, et al. Oscillatory subthalamic nucleus activity is modulated by dopamine during emotional processing in Parkinson's disease. Cortex 2014; 60: 69-81.

      Hirschmann J, Ozkurt TE, Butz M, Homburger M, Elben S, Hartmann CJ, et al. Differential modulation of STN-cortical and cortico-muscular coherence by movement and levodopa in Parkinson's disease. NeuroImage 2013; 68: 203-13.

      Litvak V, Eusebio A, Jha A, Oostenveld R, Barnes G, Foltynie T, et al. Movement-related changes in local and long-range synchronization in Parkinson's disease revealed by simultaneous magnetoencephalography and intracranial recordings. Journal of Neuroscience 2012; 32(31): 10541-53.

      van Wijk BCM, Neumann WJ, Schneider GH, Sander TH, Litvak V, Kuhn AA. Low-beta cortico-pallidal coherence decreases during movement and correlates with overall reaction time. NeuroImage 2017; 159: 1-8.

      Taulu S, Simola J. Spatiotemporal signal space separation method for rejecting nearby interference in MEG measurements. Physics in Medicine and Biology 2006; 51(7): 1759-68.

      Cao C, Huang P, Wang T, Zhan S, Liu W, Pan Y, et al. Cortico-subthalamic Coherence in a Patient With Dystonia Induced by Chorea-Acanthocytosis: A Case Report. Frontiers in Human Neuroscience 2019; 13: 163.

      Cao C, Li D, Zhan S, Zhang C, Sun B, Litvak V. L-dopa treatment increases oscillatory power in the motor cortex of Parkinson's disease patients. NeuroImage Clinical 2020; 26: 102255.

      Litvak V, Eusebio A, Jha A, Oostenveld R, Barnes GR, Penny WD, et al. Optimized beamforming for simultaneous MEG and intracranial local field potential recordings in deep brain stimulation patients. NeuroImage 2010; 50(4): 1578-88.

      Litvak V, Jha A, Eusebio A, Oostenveld R, Foltynie T, Limousin P, et al. Resting oscillatory cortico-subthalamic connectivity in patients with Parkinson's disease. Brain 2011; 134(Pt 2): 359-74.

      Hirschmann J, Ozkurt TE, Butz M, Homburger M, Elben S, Hartmann CJ, et al. Distinct oscillatory STN-cortical loops revealed by simultaneous MEG and local field potential recordings in patients with Parkinson's disease. NeuroImage 2011; 55(3): 1159-68.

      I doubt that the correlation between habenula power and habenula-MEG coherence (Fig. 6C) is informative of emotion processing. First, power and coherence in close-by time windows are likely to to be correlated irrespective of the task/stimuli. Second, if meaningful, one would expect the strongest correlation for the negative condition, as this is the only condition with an increase of theta coherence and a subsequent increase of theta power in the habenula. This, however, does not appear to be the case.

      The authors included the factors valence and arousal in their linear model and found that only valence correlated with electrophysiological effects. I suspect that arousal and valence scores are highly correlated. When fed with informative yet highly correlated variables, the significance of individual input variables becomes difficult to assess in many statistical models. Hence, I am not convinced that valence matters but arousal not.

      For the correlation shown in Fig. 6C, we used a linear mixed-effect modelling (‘fitlme’ in Matlab) with different recorded subjects as random effects to investigate the correlations between the habenula power and habenula-MEG coherence at an earlier window, while considering all trials together. Therefore the reported value in the main text and in the figure (k = 0.2434 ± 0.1031, p = 0.0226, R2 = 0.104) show the within subjects correlation that are consistent across all measured subjects. The correlation is likely to be mediated by emotional valence condition, as negative emotional stimuli tend to be associated with both high habenula-MEG coherence and high theta power in the later time window tend to happen in the trials with.

      The arousal scores are significantly different for the three valence conditions as shown in Fig. 1B. However, the arousal scores and the valence scores are not monotonically correlated, as shown in the following figure (Fig. S2). The emotional neutral figures have the lowest arousal value, but have the valence value sitting between the negative figures and the positive figures. We have now added the following sentence in the main text:

      "This nonlinear and non-monotonic relationship between arousal scores and the emotional valence scores allowed us to differentiate the effect of the valence from arousal."

      Table 2 in the main text show the results of the linear mixed-effect modelling with the neural signal as the dependent variable and the valence and arousal scores as independent variables. Because of the non-linear and non-monotonic relationship between the valence and arousal scores, we think the significance of individual input variables is valid in this statistical model. We have now added a new figure (shown below, Fig. 7) with scatter plots showing the relationship between the electrophysiological signal and the arousal and emotional valence scores separately using Spearman’s partial correlation analysis. In each scatter plot, each dot indicates the average measurement from one participant in one emotional valence condition. As shown in the following figure, the electrophysiological measurements linearly correlated with the valence score, but not with the arousal scores. However, the statistics reported in this figure considered all the dots together. The linear mixed effect modelling taking into account the interdependency of the measurements from the same participant. So the results reported in the main text using linear mixed effect modelling are statistically more valid, but supplementary figure here below illustrate the relationship.

      Figure S2. Averaged valence and arousal ratings (mean ± SD) for figures of the three emotional condition. (B) Scatter plots showing the relationship between arousal and valence scores for each emotional condition for each participant.

      Figure 7. Scatter plots showing how early theta/alpha band power increase in the frontal cortex (A), theta/alpha band frontal cortex-habenula coherence (B) and theta band power increase in habenula stimuli (C) changed with emotional valence (left column) and arousal (right column). Each dot shows the average of one participant in each categorical valence condition, which are also the source data of the multilevel modelling results presented in Table 2. The R and p value in the figure are the results of partial correlation considering all data points together.

      Page 8: "The time-varying coherence was calculated for each trial". This is confusing because coherence quantifies the stability of a phase difference over time, i.e. it is a temporal average, not defined for individual trials. It has also been used to describe the phase difference stability over trials rather than time, and I assume this is the method applied here. Typically, the greatest coherence values coincide with event-related power increases, which is why I am surprised to see maximum coherence at 1s rather than immediately post-stimulus.

      We thank the reviewer for pointing out this incorrect description. As the reviewer pointed out correctly, the method we used describe the phase difference stability over trials rather than time. We have now clarified how coherence was calculated and added more details in the methods:

      "The time-varying cross trial coherence between each MEG sensor and the habenula LFP was first calculated for each emotional valence condition. For this, time-frequency auto- and cross-spectral densities in the theta/alpha frequency band (5-10 Hz) between the habenula LFP and each MEG channel at sensor level were calculated using the wavelet transform-based approach from -2000 to 4000 ms for each trial with 1 Hz steps using the Morlet wavelet and cycle number of 6. Cross-trial coherence spectra for each LFP-MEG channel combination was calculated for each emotional valence condition for each habenula using the function ‘ft_connectivityanalysis’ in Fieldtrip (version 20170628). Stimulus-related changes in coherence were assessed by expressing the time-resolved coherence spectra as a percentage change compared to the average value in the -2000 to -200 ms (pre-stimulus) time window for each frequency."

      In the Morlet wavelet analysis we used here, the cycle number (C) determines the temporal resolution and frequency resolution for each frequency (F). The spectral bandwidth at a given frequency F is equal to 2F/C while the wavelet duration is equal to C/F/pi. We used a cycle number of 6. For theta band activities around 5 Hz, we will have the spectral bandwidth of 25/6 = 1.7 Hz and the wavelet duration of 6/5/pi = 0.38s = 380ms.

      As the reviewer noticed, we observed increased activities across a wide frequency band in both habenula and the prefrontal cortex within 500 ms after stimuli onset. But the increase of cross-trial coherence starts at around 300 ms. The increase of coherence in a time window without increase of power in either of the two structures indicates a phase difference stability across trials in the oscillatory activities from the two regions, and this phase difference stability across trials was not secondary to power increase.

      Reviewer #3 (Public Review):

      This paper describes the oscillatory activity of the habenula using local field potentials, both within the region and, through the use of MEG, in connection to the prefrontal cortex. The characteristics of this activity were found to vary with the emotional valence but not with arousal. Sheding light on this is relevant, because the habenula is a promising target for deep brain stimulation.

      In general, because I am not much on top of the literature on the habenula, I find difficult to judge about the novelty and the impact of this study. What I can say is that I do find the paper is well-written and very clear; and the methods, although quite basic (which is not bad), are sound and rigourous.

      We thank the reviewer for the positive comments about the potential implication of our study and on the methods we used.

      On the less positive side, even though I am aware that in this type of studies it is difficult to have high N, the very low N in this case makes me worry about the robustness and replicability of the results. I'm sure I have missed it and it's specified somewhere, but why is N different for the different figures? Is it because only 8 people had MEG? The number of trials seems also a somewhat low. Therefore, I feel the authors perhaps need to make an effort to make up for the short number of subjects in order to add confidence to the results. I would strongly recommend to bootstrap the statistical analysis and extract non-parametric confidence intervals instead of showing parametric standard errors whenever is appropriate. When doing that, it must be taken into account that each two of the habenula belong to the same person; i.e. one bootstraps the subjects not the habenula.

      We do understand and appreciate the concern of the reviewer on the low sample numbers due to the strict recruitment criteria for this very early stage clinical trial: 9 patients for bilateral habenula LFPs, and 8 patients with good quality MEGs. Some information to justify the number of trials per condition for each participant has been provided in the reply to the Detailed Comments 1 from Reviewer 2. The sample number used in each analysis was included in the figures and in the main text.

      We have used non-parametric cluster-based permutation approach (Maris and Oostenveld, 2007) for all the main results as shown in Fig. 3-5. Once the clusters (time window and frequency band) with significant differences for different emotional valence conditions have been identified, parametric statistical test was applied to the average values of the clusters to show the direction of the difference. These parametric statistics are secondary to the main non-parametric permutation test.

      In addition, the DICS beamforming method was applied to localize cortical sources exhibiting stimuli-related power changes and cortical sources coherent with deep brain LFPs for each subject for positive and negative emotional valence conditions respectively. After source analysis, source statistics over subjects was performed. Non-parametric permutation testing with or without cluster-based correction for multiple comparisons was applied to statistically quantify the differences in cortical power source or coherence source between negative and positive emotional stimuli.

      References:

      Maris E, Oostenveld R. Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods 2007; 164(1): 177-90.

      Related to this point, the results in Figure 6 seem quite noisy, because interactions (i.e. coherence) are harder to estimate and N is low. For example, I have to make an effort of optimism to believe that Fig 6A is not just noise, and the result in Fig 6C is also a bit weak and perhaps driven by the blue point at the bottom. My read is that the authors didn't do permutation testing here, and just a parametric linear-mixed effect testing. I believe the authors should embed this into permutation testing to make sure that the extremes are not driving the current p-value.

      We have now quantified the coherence between frontal cortex-habenula and occipital cortex-habenula separately (please see more details in the reply to Reviewer 2 (Recommendations for the authors 6). The new analysis showed that the increase in the theta/alpha band coherence around 1 s after the negative stimuli was only observed between prefrontal cortex-habenula and not between occipital cortex-habenula. This supports the argument that Fig. 6A is not just noise.

    1. Author Response

      Reviewer #1:

      Köster and colleagues present a brief report in which they study in 9 month-old babies the electrophysiological responses to expected and unexpected events. The major finding is that in addition to a known ERP response, an NC present between 400-600 ms, they observe a differential effect in theta oscillations. The latter is a novel result and it is linked to the known properties of theta oscillations in learning. This is a nice study, with novel results and well presented. My major reservation however concerns the push the authors make for the novelty of the results and their interpretation as reflecting brain dynamics and rhythms. The reason for that is, that any ERP, passed through the lens of a wavelet/FFT etc, will yield a response at a particular frequency. This is especially the case for families of ERP responses related to unexpected event e.g., MMR, and NC, etc. For which there is plenty of literature linking them to responses to surprising event, and in particular in babies; and which given their timing will be reflected in delta/theta oscillations. The reason why I am pressing on this issue, is because there is an old, but still ongoing debate attempting to dissociate intrinsic brain dynamics from simple event related responses. This is by no means trivial and I certainly do not expect the authors to resolve it, yet I would expect the authors to be careful in their interpretation, to warn the reader that the result could just reflect the known ERP, to avoid introducing confusion in the field.

      We would like to thank the author for highlighting the novelty of the results. Critically, there is one fundamental difference in investigating the ERP response and the trial-wise oscillatory power, which we have done in the present analysis: when looking at the evoked oscillatory response (i.e., the TF characteristics of the ERP), the signal is averaged over trials first and then subjected to a wavelet transform. However, when looking at the ongoing (or total) oscillatory response, the wavelet transform is applied at the level of the single trial, before the TF response of the single trials is averaged across the trials of one condition trials (for a classical illustration, see Tallon-Baudry & Bertrand, 1999; TICS, Box 2). We have now made this distinction more salient throughout the manuscript.

      In the present study, the results did not suggest a relation between the ERP and the ongoing theta activity, because the topography, temporal evolution, and polarity of the ERP and the theta response were very dissimilar: Looking at Figure 2 (A and B) and Figure 3 (B and C), the Nc peaks at central electrodes, but the theta response is more distributed, and the expected versus unexpected difference was specific for the .4 to .6 s time window, but the theta difference lasted the whole trial. Furthermore, the NC was higher for expected versus unexpected, which should (due to the low frequency) rather lead to a higher theta power for unexpected, in contrast to expected events for the time frequency analysis for the Nc. To verify this intuition, we now ran a wavelet analysis on the evoked response (i.e., the ERP) and, for a direct comparison, also plotted the ongoing oscillatory response for the central electrodes (see Additional Figure 1). These additional analyses nicely illustrate that the trial-wise theta response provides a fundamentally different approach to analyze oscillatory brain dynamics.

      Because this is likely of interest to many readers, we also report the results of the wavelet analysis of the ERP versus the analysis of the ongoing theta activity at central electrodes and the corresponding statistics in the result section, and have also included the Additional Figure in the supplementary materials, as Figure S2.

      Additional Figure 1. Comparison of the topography and time course for the 4 – 5 Hz activity for the evoked (A, B) and the ongoing (C, D) oscillatory response at central electrodes (400 – 600 ms; Cz, C3, C4; baseline: -100 – 0 ms). (A) Topography for the difference between unexpected and expected events in the evoked oscillatory response. (B) The corresponding time course at central electrodes, which did not reveal a significant difference between 400 – 600 ms, t(35) = 1.57, p = .126. (C) Topography for the same contrast in the ongoing oscillatory response and (D) the corresponding time course at central electrodes, which did likewise not reveal a significant difference between 400 – 600 ms, t(35) = -1.26, p = .218. The condition effects (unexpected - expected) were not correlated between the evoked and the ongoing response, r = .23, p = .169.

      A second aspect that I would like the authors to comment on is the power of the experimental design to measure surprise. From the methods, I gathered that the same stimulus materials and with the same frequency were presented as expected and unexpected endings. If that is the case, what is the measure of surprise? For once the same materials are shown causing habituation and reducing novelty and second the experiment introduces a long-term expectation of a 50:50 proportion of expected/unexpected events. I might be missing something here, which is likely as the methods are quite sparse in the description of what was actually done.

      We have used 4 different stimuli types (variants) in each of the 4 different domains, with either an expected or unexpected outcome. This resulted in 32 distinct stimulus sequences, which we presented twice, resulting in (up to) 64 trials. We have now described this approach and design in more detail and have also included all stimuli as supplementary material (Figure S1). In particular, we have used multiple types in each domain to reduce potential habituation or expectation effects. Still, we agree that one difficulty may be that, over time, infants got used to the fact that expected and unexpected outcomes were to be similarly “expected” (i.e., 50:50). However, if this was the case it would have resulted in a reduction (or disappearance) of the condition effect, and would thus also reduce the condition difference that we found, rather than providing an alternative explanation. We now included this consideration in the method section (p. 7).

      Two more comments concerning the analysis choices:

      1) The statistics for the ERP and the TF could be reported using a cluster size correction. These are well established statistical methods in the field which would enable to identify the time window/topography that maximally distinguished between the expected and the unexpected condition both for ERP and TF. Along the same lines, the authors could report the spatial correlation of the ERP/TF effects.

      For the ERP analysis we used the standard electrodes typically analyzed for the Nc in order to replicate effects found in former research (Langeloh et al., 2020; see also, Kayhan et al., 2019; Reynolds and Richards, 2005; Webb et al., 2005). For the TF analyses we used the most conservative criterion, namely all scalp recorded electrodes and the whole time window from 0 to 2000 ms, such that we did not make any choice regarding time window or the electrodes (i.e., which could be corrected for against other choices). We have now made those choices clearer in the method section, and why we think that, under these condition a multiple comparison correction is not needed/applicable (p. 10). Regarding the spatial correlation of the ERP and TF effects, we explained in response to the first comment the very different nature of the TF decomposition of the ERP and ongoing oscillatory activity and also that these were found to be interdependent (i.e., uncorrelated). We hope that with the additional analysis included in response to this comment that this difference is much clearer now.

      2) While I can see the reason why the authors chose to keep the baseline the same between the ERP and the TF analysis, for time frequency analysis it would be advisable to use a baseline amounting to a comparable time to the frequency of interest; and to use a period that does not encroach in the period of interest i.e., with a wavelet = 7 and a baseline -100:0 the authors are well into the period of interested.

      The difficulty in choosing the baseline in the present study was two-fold. First, we were interested in the ERP and the change in neural oscillations upon the onset of an outcome picture within a continuous presentation of pictures, forming a sequence. Second, we wanted to use a similar baseline for both analyses, to make them comparable. Because the second picture (the picture before the outcome picture) also elicited both an ERP and an oscillatory response at ~ 4 Hz (see Additional Figure 2), we choose a baseline just before the onset of the outcome stimulus, from -100 to 0 ms. Also we agree that the possibility to take a longer and earlier baseline, in particular for the TF results would have been favorable, but still consider that the -100 to 0 ms is still the best choice for the present analysis. Notably, because we found an increase in theta oscillations and the critical difference relies on a higher theta rhythm in one compared to the other condition, the effects of the increase in theta, if they effected the baseline, this effect would counteract rather than increase the current effect. We now explain this choice in more detail (p.10).

      Additional Figure 1. Display of the grand mean signals prior to the -100 to 0 baseline and outcome stimulus. (A) The time-frequency response across all scalp-recorded electrodes, as well as (B) the ERP at the central electrodes (Cz, C3, C4) across both conditions show a similar response to the 2. picture like the outcome picture. Thus a baseline just prior to the stimulus of interest was chosen, consistent for both analyses.

      Reviewer #2:

      The manuscript reports increases in theta power and lower NC amplitude in response to unexpected (vs. expected) events in 9-month-olds. The authors state that the observed increase in theta power is significant because it is in line with an existing theory that the theta rhythm is involved in learning in mammals. The topic is timely, the results are novel, the sample size is solid, the methods are sound as far as I can tell, and the use of event types spanning multiple domains (e.g. action, number, solidity) is a strength. The manuscript is short, well-written, and easy to follow.

      1) The current version of the manuscript states that the reported findings demonstrate that the theta rhythm is involved in processing of prediction error and supports the processing of unexpected events in 9-month-old infants. However, what is strictly shown is that watching at least some types of unexpected events enhance theta rhythm in 9-month-old infants, i.e. an increase in the theta rhythm is associated with processing unexpected events in infants, which suggests that an increase in the theta rhythm is a possible neural correlate of prediction error in this age range. While the present novel findings are certainly suggestive, more data and/or analyses would be needed to corroborate/confirm the role of the observed infant theta rhythm in processing prediction error, or document whether and how this increase in the theta rhythm supports the processing of unexpected events in infants. (As an example, since eye-tracking data were collected, are trial-by-trial variations in theta power increases to unexpected outcomes related to how long individual infants looked to the unexpected outcome pictures?) If it is not possible to further confirm/corroborate the role of the theta rhythm with this dataset, then the discussion, abstract, and title should be revised to more closely reflect what the current data shows (as the wording of the conclusion currently does), and clarify how future research may test the hypothesis that the infant theta rhythm directly supports the processing of prediction error in response to unexpected events.

      We would like to thank the reviewer for acknowledging the merit of the present research.

      On the one hand, we have revised our manuscript and are now somewhat more careful with our conclusion, in particular with regard to the refinement of basic expectations. On the other hand, we consider the concept of “violation to expectation” (VOE), which is one of the most widely used concepts in infancy research, very closely linked to the concept of a prediction error processing, namely a predictive model is violated. In particular, we have made this conceptual link in a recent theoretical paper (Köster et al., 2020), and based on former theoretical considerations about the link between these two concepts (e.g., see Schubotz 2015; Prediction and Expectation). In particular, in the present study we used a set of four different domains of violation of expectation paradigms, which are among the best established domains of infants core knowledge (e.g., action, solidity, cohesion, number; cf. Spelke & Kinzler, 2007). It was our specific goal not to replicate, for another time, that infants possess expectations (i.e., make predictions) in these domains, but to “flip the coin around” and investigate infants’ prediction error more generally, independent of the specific domain. We have now made the conceptual link between VOE and prediction error processing more explicit in the introduction of the manuscript and also emphasize that we choose a variety of domains to obtain a more general neural marker for infant processing of prediction errors.

      Having said this, indeed, we planned to assess and compare both infants gaze behavior and EEG response. Unfortunately, this was not very successful and the concurrent recording only worked for a limited number of infants and trials. This led us to the decision to make the eye-tracking study a companion study and to collect more eye-tracking data in an independent sample of infants after the EEG assessment was completed, such that a match between the two measures was not feasible. We now make this choice more explicit in the method section (p. 7). In addition, contrary to our basic assumption we did not find an effect in the looking time measure. Namely, there was no difference between expected and unexpected outcomes. We assume that this is due to the specificities of the current design that was rather optimized for EEG assessments: We used a high number of repetitions (64), with highly variable domains (4), and restricted the time window for potential looking time effects to 5 seconds, which is highly uncommon in the field and therefore not directly comparable with former studies.

      Finally, besides the ample evidence from former studies using VOE paradigms, if it were not the unexpected vs. expected (i.e., unpredicted vs. predicted) condition contrast which explains the differences we found in the ERP and the theta response, there would need to be an alternative explanation for the differential responses in the EEG, which produce the hypothesized effects. (Please also note that there are many studies relying their VOE assumption on ERPs alone, here we have two independent measures suggesting that infants discriminated between those conditions.)

      2) The current version of the manuscript states "The ERP effect was somewhat consistent across conditions, but the effect was mainly driven by the differences between expected and unexpected events in the action and the number domain (Figure S1). The results were more consistent across domains for the condition difference in the 4 - 5 Hz activity, with a peak in the unexpected-expected difference falling in the 4 - 5 Hz range across all electrodes (Figure S2)". However, the similarity/dissimilarity of NC and theta activity responses across domains was not quantified or tested. Looking at Figures S1 and S2, it is not that obvious to me that theta responses were more consistent across domains than NC responses. I understand that there were too few trials to formally test for any effect of domain (action, number, solidity, cohesion) on NC and theta responses, either alone or in interaction with outcome (expected, unexpected). It may still be possible to test for correlations of the topography and time-course of the individual average unexpected-expected difference in NC and theta responses across domains at the group level, or to test for an effect of outcome (expected, unexpected) in individual domains for subgroups of infants who contributed enough trials. Alternatively, claims of consistency across domains may be altered throughout, in which case the inability to test whether the theta and/or NC signatures of unexpected event processing found are consistent across domains (vs. driven by some domains) should be acknowledged as a limitation of the present study.

      We agree that this statement rather reflected our intuition and would not surpass statistical analysis given the low number of trials. So we are happy to refrain from this claim and simply refer to the supplementary material for the interested reader and also mention this as a perspective for future research in the discussion (p. 12; p. 15).

      As outlined in our previous response, it was also not our goal to draw conclusions about each single domain, but rather to present a diversity of stimulus types from different core knowledge domains to gain a more generalized neural marker for infants’ processing of unexpected, i.e., unpredicted events.

      Reviewer #3:

      General assessment:

      In this manuscript, the authors bring up a contemporary and relevant topic in the field, i.e. theta rhythm as a potential biomarker for prediction error in infancy. Currently, the literature is rich on discussions about how, and why, theta oscillations in infancy implement the different cognitive processes to which they have been linked. Investigating the research questions presented in this manuscript could therefore contribute to fill these gaps and improve our understanding of infants' neural oscillations and learning mechanisms. While we appreciate the motivation behind the study and the potential in the authors' research aim, we find that the experimental design, analyses and conclusions based on the results that can be drawn thereafter, lack sufficient novelty and are partly problematic in their description and implementation. Below, we list our major concerns in more detail, and make suggestions for improvements of the current analyses and manuscript.

      Summary of major concerns:

      1) Novelty:

      (a) It is unclear how the study differs from Berger et al., 2006 apart from additional conditions. Please describe this study in more detail and how your study extends beyond it.

      We would like to thank the reviewers for emphasizing the timeliness and relevance of the study.

      The critical difference between the present study and the study by Berger et al. 2006 was that the authors applied, as far as we understand this from Figure 4 and the method section of their study, the wavelet analysis to the ERP signal. In contrast, in the present study, we applied the wavelet analysis at the level of single trials. We now explain the difference between the two signals in more detail in the revised manuscript and also included an additional comparison between the evoked (i.e., ERP) and the ongoing (i.e., total) oscillatory response (for more details, please see the first response to the first comment of reviewer 1).

      (b) Seemingly innovative aspects (as listed below), which could make the study stand out among previous literature, but are ultimately not examined. Consequently, it is also not clear why they are included.

      -Relation between Nc component and theta.

      -Consistency of the effect across different core knowledge domains.

      -Consistency of the effect across the social and non-social domains.

      -Link between infants looking at time behavior and theta.

      We are thankful for these suggestions, which are closely related to the points raised by reviewer 1 and 2. With regard to the relation between the Nc and the theta response, we have now included a direct comparison of these signals (see Additional Figure 1, i.e., novel Figure S2; for details, please see the first response to the first comment of reviewer 1). Regarding the consistency of effects across domains, we have explained in response to point 1 by reviewer 2 that this was not the specific purpose of the present study, but we aimed at using a diversity of VOE stimuli to obtain a more general neural signature for infants’ prediction error processing, and explain this in more detail in the revised manuscript. Having said this, we agree that the question of consistency of effects between conditions is highly interesting, but we would not consider the data robust enough to confidently test these differences given the limited number of trials available per stimulus category. We now discuss this as a direction for future research (p. 15). Finally, we also agree with regard to the link between looking times and the theta rhythm. As also outlined in response to point 1 by reviewer 2 (paragraph 2), we initially had this plan, but did not succeed in obtaining a satisfactory number of trials in the dual recording of EEG and eye-tracking, which made us change these plans. This is now explained in detail in the method section (p. 7).

      (c) The reason to expect (or not) a difference at this age, compared to what is known from adult neural processing, is not adequately explained.

      -Potentially because of neural generators in mid/pre-frontal cortex? See Lines 144-146.

      The overall aim of the present study was to identify the neural signature for prediction error processing in the infant brain, which has, to the best of our knowledge, not been done this explicitly and with a focus on the ongoing theta activity and across a variety of violations in infants’ core knowledge domains. Because we did not expect a specific topography of this effect, in particular across multiple domains, we included all electrodes in the analyses. We have now clarified this in the method section (p. 10).

      (d) The study is not sufficiently embedded in previous developmental literature on the functionality of theta. That is, consider theta's role in error processing, but also the increase of theta over time of an experiment and it's link to cognitive development. See, for example: Braithwaite et al., 2020; Conejero et al., 2018; Adam et al., 2020.

      We are thankful that the reviewer indicated these works and have now included them in the introduction and discussion. Closest to the present study is the study by Conejero et al., 2018. However, this study is also based on theta analyses of the ERP, not of the ongoing oscillatory response and it includes considerably older infants (i.e., 16-month-olds instead of 9-month-olds as in the present study).

      2) Methodology:

      (a) Design: It is unclear what exactly a testing session entails.

      -Was the outcome picture always presented for 5secs? The methods section suggests that, but the introduction of the design and Figure 1 do not. This might be misleading. Please change in Figure 1 to 5sec if applicable.

      Yes, the final images were shown for 5s in order to simultaneously assess infants’ looking times. However, we included trials in the EEG analysis if infants looked for 2s, so this is the more relevant info for the analysis. We now clarified this in the method section (p. 7) and have also added this info in the figure caption.

      -Were infants' eye-movements tracked simultaneously to the EEG recording? If so, please present findings on their looking time and (if possible) pupil size. Also examine the relation to theta power. This would enhance the novelty and tie these findings to the larger looking time literature that the authors refer to in their introduction.

      Yes, in response to the second reviewer (comment 1) we explained in more detail why the joint analysis of the EEG and looking time data was not possible: We planned to assess both, infants gaze behavior and EEG response. Unfortunately, this was not very successful and the dual recording only worked for a few infants and trials. This led us to collect more eye-tracking data after the EEG assessment was completed, such that a match between the two measures was not feasible. We now clarified this in the method section (p. 7).

      (b) Analysis:

      -In terms of extracting theta power information: The baseline of 100ms is extremely short for a comparison in the frequency domain, since it does not even contain half a cycle of the frequency of interest, i.e. 4Hz. We appreciate the thought to keep the baseline the same as in the ERP analysis (which currently is hardly focused on in the manuscript), but it appears problematic for the theta analysis. Also, if we understand the spectral analysis correctly, the window the authors are using to estimate their spectral estimates is largely overlapping between baseline and experimental window. The question arises whether a baseline is even needed here, or if a direct contrast between conditions might be better suited.

      Please see our explanation about the choice of the baseline in our response to reviewer 1, comment 2. Because our stimulus sequences were highly variable, likely leading to highly variable overall theta activity, and our specific interest was in the change in theta activity upon the onset of the unexpected versus unpredicted outcome, we still consider it useful to take a baseline here. Also because this makes the study more closely comparable to the existing literature. We now clarified this in the method section (p. 9)

      -In terms of statistical testing

      -It appears that the authors choose the frequency band that will be entered in the statistical analysis from visual inspection of the differences between conditions. They write: "we found the strongest difference between 4 - 5 Hz (see lower panel of Figure 3). Therefore, and because this is the first study of this kind, we analyzed this frequency range." ll. 277-279). This approach seems extremely problematic since it poses a high risk for 'double-dipping'. This is crucial and needs to be addressed. For instance, the authors could run non-parametric permutation tests on the time-frequency domain using FDR correction or cluster-based permutation tests on the topography.

      -Lack of examining time- / topographic specificity.

      Please also note the sentence before this citation, which states our initial hypothesis: “While our initial proposal was to look at the difference in the 4 Hz theta rhythm between conditions (Köster et al., 2019), we found the strongest difference between 4 – 5 Hz (see lower panel of Figure 3).” Note that the hypothesis of 4 Hz can be clearly derived from our 2019 study. We would maintain that the center frequency we took for the analysis 4.5Hz (i.e., 4 – 5Hz) is very close to this original hypothesis and, considering that we applied a novel design and analyses in very young infants, could indeed hardly have fallen more closely to this initial proposal. The frequency choice is also underlined, as the reviewer remarks, by the consistency of this peak across domains, peaking at 4Hz (cohesion), 4.5Hz (action), and 5Hz (solidity, number). Importantly, please note that we have chosen the electrodes and time window very conservatively, namely by including the whole time period and all electrodes, which we now explain in more detail on p. 10. Please also see our response to reviewer 1, comment “1)”.

      3) Interpretation of results:

      (a) The authors interpret the descriptive findings of Figure S1 as illustration of the consistency of the results across the four knowledge domains. While we would partly agree with this interpretation based on column A of that figure (even though also there the peak shifts between domains), columns B and C do not picture a consistent pattern of data. That is, the topography appears very different between domains and so does the temporal course of the 4-5Hz power, with only showing higher power in the action and number domain, not in the other two. Since none of these data were compared statistically, any interpretation remains descriptive. Yet, we would like to invite the authors to critically reconsider their interpretation. You also might want to consider adding domain (action, number etc.) as a covariate to your statistical model.

      We agree with the reviewers (reviewer 2 and reviewer 3) that our initial interpretation of the data regarding the consistency of effects across domains may have been too strong. Thus, in the revised version of the manuscript, we do not state that the TF analysis revealed more consistent results. Given that the analysis was based on a different subsample and highly variable in trial numbers, we did not enter them as a covariate in the statistical model.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03220

      Corresponding author(s): Ryusuke Niwa, Yuko Shimada-Niwa, and Wei Sun

      Dear Editors,

      We are pleased to submit our revised manuscript of RC-2025-03220R. The reviewers’ comments from Review Commons are presented in italic.

      For submission of our current revised manuscript, we provide two Word files, which are the “clean” and “Track-and-Change” files. Page and line numbers described below correspond to those of the “clean” file. The “Track-and-Change” file might be helpful for Reviewers to find what we have changed for the current revision.

      We hope that the revised version is now suitable for the next stage of evaluation.

      Sincerely,

      Ryusuke Niwa, Yuko Shimada-Niwa, and Wei Sun

      1. General Statements [optional]

      We sincerely thank the reviewers for their thoughtful feedback on our initial submission. Experiments that we will conduct and the revisions on the manuscript that have already been incorporated are detailed below in the point-by-point response. For this revised submission, two versions of the manuscript are provided: a clean copy and a tracked-changes file. Page and line numbers mentioned below refer to the clean version, while the tracked-changes file is intended to help reviewers easily identify the revisions made.

      In preparing the revision plan, we have included additional data, some of which were generated in collaboration with new contributors. Accordingly, we would like to propose adding Yuichi Shichino and Shintaro Iwasaki as co-authors to acknowledge their contributions.

      2. Description of the planned revisions__ __

      __

      - Also, the authors show that two different RNAi lines for NudC give the same defects - it would be good to know if the RNAi lines target the same or different sequences in the NudC transcripts. Alternatively, it would be equally good to show that trans-allelic combinations of NudC mutants have the same defects in the prothoracic glands and the salivary glands as the RNAi. Instead, they examine only overall body size, developmental delays and lethality in the trans-hetero allelic NudC mutants.

      Author response:

      In response to the second part of the criticism, we will further validate the observed phenotypes by examining tissue and nuclear size, chromosomal structure, and the levels of Fibrillarin and RpS6 proteins in the prothoracic glands and salivary glands of NudC mutants.

      __

      - It would be quite helpful to characterize the "5 blob" and "shortened polytene chromosome arm" defects shown in Figure 2 and Figure 6. Are these partially polytenized chromosomes or are large sections of the chromosomes missing or just underreplicated? What do the chromosomes look like if you lyse the nuclei, spread the chromosomes and stain with DAPI or Hoechst - this is a pretty standard practice and would reveal much more about the structure of the polytene chromosomes.

      Author response:

      To address these structural concerns more clearly, we plan to apply established protocols to obtain higher-resolution images and gather more detailed information on chromosome morphology.

      __ - Discussion, line 468. I don't think the authors have provided evidence of DNA damage. With the experiments they have shown, the chromosomes look abnormal - not clear what is abnormal.

      Author response:

      To further confirm DNA damage in NudC knockdown salivary gland cells, we plan to perform a TUNEL assay, which detects DNA fragmentation associated with damage.

      We would like to note that, in the current manuscript, we have shown that depletion of NudC, eIF5, RpLP0-like, or Nopp140 increased γH2Av levels, suggesting activation of the DNA damage response (Figures 6B and 6C).

      __

      *The authors claim that NudC has a dual role as a cell cycle/cytoskeleton regulator and as a ribosome biogenesis factor. However, because NudC knockdown reduces nuclear size and ploidy (Figures 1F and 2H-2I), the authors cannot exclude that decreased rDNA dosage and nucleolar volume contribute to reduced rRNA signals and that the effects seen are due to a NudC involvement in endoreplication, the rRNA reduction being a consequence of lower polyploidy. Different allelic combinations of NudC induce larval growth defects (Figure S5), consistent with a NudC role in endoreplication. To circumvent this, the authors could genetically modulate endocycle progression (e.g., E2F or Fzr overexpression) in the NudC RNAi background to test whether inducing endoreplication rescues rRNA production and nucleolar volume. This would establish causality between the endocycle state and rRNA output and clarify whether NudC's primary role is in RiBi or endocycle control. *

      Author response: In response to Reviewer #2’s suggestion, we plan to genetically modify the progression of the endocycle by inducing continuous expression of Cyclin E (CycE), E2F1, and Fzr in NudC RNAi salivary glands to test whether promoting endoreplication can restore rRNA production and nucleolar volume.

      In fact, we have attempted to rescue the developmental arrest in animals with NudC-deficient prothoracic glands (PGs) by inducing continuous expression of CycE. Two constructs, UAS-CycE-1 (BDSC#30725) and UAS-CycE-2 (BDSC#30924), were used. UAS-CycE-1 has previously been shown to rescue developmental arrest in PG-specific TOR loss-of-function animals (Ohhara, Kobayashi, and Yamanaka. PLoS Genetics 13 (1): e1006583, 2017). We introduced each construct into NudC knockdown PGs. However, continuous expression of CycE did not restore development (Figure A as shown below), suggesting that NudC functions in the polyploid cells extend beyond endocycle regulation. We do not currently plan to include the PG data shown in Figure A in the revised manuscript. We will evaluate whether it would be meaningful to present PG data alongside salivary gland results once we have obtained and analyzed data from the salivary gland rescue experiment.

      __Figure A. _Survival and developmental progression following continuous expression of CycE._ __Control (phtm>dicer2, +), NudC knockdown (phtm>dicer2, NudC RNAi), and NudC RNAi + CycE (phtm>dicer2, NudC RNAi, CycE) flies were analyzed at 10 days after hatching (10 dAH). Dead indicates dead larvae; L3 denotes third-instar larvae. Sample sizes (number of flies) are shown below each bar.

      __

      *The conclusion that NudC maintains rRNA levels is derived from salivary gland RNAi phenotypes with strong reductions in ITS1/ITS2 and 18S/28S signals (Figure 4B-4K) and reduced 28S by Northern (Figure 4L), plus corroboration in fat body cells (Figure S7). The authors verified knockdown using two independent RNAi lines for growth phenotypes and NudC::GFP reduction (Figure S2) and generated a UAS-FLAG::NudC transgene (Key Resources), but rRNA measurements were reported for only one RNAi line without rescue. Rescue of the rRNA phenotype by transgenic NudC re-expression, or replication of the rRNA decrease with a second, non-overlapping RNAi, would directly attribute the effect to NudC. In the absence of these standard validation controls, an off-target explanation remains plausible. *

      Author response:

      We plan to analyze rRNA FISH signals in salivary glands and fat bodies using a second, non-overlapping RNAi strain to confirm the reproducibility of the observed effects.

      __ - The authors report in Fig. 2 elevated γH2Av in SG cells upon NudC knockdown and interpret this as evidence of chromosome destabilization. They also state that apoptosis is not observed in Fig S10. However, the increase in γH2Av could reflect transient or early apoptotic events or other stress responses triggered by NudC depletion, rather than direct defects in endoreplication or genome stability. I suggest that the authors clarify this important point, for example, by co-expressing apoptotic inhibitors such as P35, or by using the TUNEL assay, which is more sensitive than anti-Caspase3 or Dcp1 antibodies.

      Author response:

      We plan to perform a TUNEL assay on salivary gland cells to evaluate apoptosis associated with NudC depletion.

      __ - Activation of the JNK pathway is often accompanied by apoptosis. It would strengthen the conclusions if the authors included a positive control to confirm that apoptosis is not induced under these experimental conditions, ensuring that the observed effects are specific to autophagy and not confounded by cell death.

      Author response:

      We will analyze pJNK and autophagy levels in animals expressing a constitutively-active form of hemipterous (hep) (hep[CA] ) under the control of fkh-GAL4 driver as a positive control. hep encodes the Drosophila JNK kinase, and it is well established that forced expression of hep[CA] induces JNK phosphorylation and activation.

      __ - In Figure S1, reduction of NudC in the fat body appears to induce a starvation-like phenotype, suggesting a potential impairment of metabolic or nutrient-sensing pathways. It would be important to determine whether modulation of nutrient-responsive signaling could rescue this phenotype. Specifically, have the authors examined whether activation of the TOR or PI3K pathways mitigates the effects of NudC knockdown? Assessing pathway activity (e.g., via phospho-S6K or phospho-Akt levels) or performing genetic rescue experiments with pathway activators could clarify whether the observed phenotypes are mediated through disrupted nutrient signaling rather than a secondary effect of general cellular stress. Such analyses could also provide a mechanistic explanation for the increased autophagy observed in these cells.

      Author response:

      1. We will analyze phospho-S6K levels in salivary glands and fat bodies by immunostaining.
      2. To activate the TOR pathway in NudC RNAi fat bodies, we will overexpress Rheb, an established upstream activator of the TOR pathway in Drosophila, which has been shown to robustly increase TOR signaling and S6K phosphorylation.

        __ - The current images of autophagic vesicles in the SG in Fig. 8B are not clearly visible and quantified. Considering the large size of these polyploid cells, higher-resolution images or alternative imaging approaches should be presented to better visualize and quantify autophagy. This would make the conclusions regarding enhanced autophagy more convincing. In addition, this data could be further strengthened by expanding the analysis of autophagy to other cell types. For example, examining autophagy in fat body cells, where autophagy plays a primary physiological role associated with rRNA accumulation (Fig. S7), rather than a reduction like in SG (Fig. 4), could provide a useful comparison for the function of NudC between polyploid cells.

      Author response:

      In response to the second part of the reviewer’s comment, we will conduct additional experiments using anti-Atg8a immunostaining and/or LysoTracker staining to analyze autophagy in NudC RNAi fat bodies and prothoracic glands. These experiments will help further characterize the cellular responses associated with NudC depletion.

      3. Description of the revisions that have already been incorporated in the transferred manuscript


      __

      -The title is a bit problematic since they haven't shown that NudC doesn't also affect normal mitotic cells - they only look at polyploid cells, but that doesn't mean normal mitotic cells are not also affected.

      Author response:

      In response to the suggestion from Reviewer #1, we have revised the title from “NudC moonlights in ribosome biogenesis and homeostasis in Drosophila melanogaster polyploid cells” to “NudC moonlights in ribosome biogenesis and homeostasis in polyploid cells of Drosophila melanogaster” to place greater emphasis on “polyploid cells.”

      Regarding mitotic cells, we have added new data in the revised manuscript (Figure S7; lines 249–256 and 417–418) demonstrating that NudC regulates apoptosis and stress responses in mitotic imaginal wing disc cells. However, as the main focus of our study remains polyploid cells, we have chosen to retain the emphasis in the title.

      __

      - Also, the authors show that two different RNAi lines for NudC give the same defects - it would be good to know if the RNAi lines target the same or different sequences in the NudC transcripts. Alternatively, it would be equally good to show that trans-allelic combinations of NudC mutants have the same defects in the prothoracic glands and the salivary glands as the RNAi. Instead, they examine only overall body size, developmental delays and lethality in the trans-hetero allelic NudC mutants.

      Author response:

      In response to the first half of criticism, the two RNAi lines used for NudC target distinct sequences. We have added the corresponding RNAi target sites to Figure S4A for clarity.

      __

      - Results: Lines 261 - 266. Seeing electron dense structures in TEMs and seeing increased Me31B staining by confocal imaging in the cytoplasm is insufficient evidence that the electron dense structures are P-bodies. They could be the P-bodies but they could also be aggregated ribosomes; there is insufficient evidence to "confirm" that they are P-bodies - maybe just say "suggests".

      Author response:

      In response to Reviewer #1’s suggestion, we have revised lines 261–262 to avoid using the word "confirm." The new sentence reads: “Immunostaining with the P-body marker Me31B reveals numerous cytoplasmic P-bodies in NudC-deficient SG cells,” which appears in lines 293–295.

      __

      - Abstract, lines 28 - 31. I think this gene has been identified before. The authors probably want to say they have discovered a role for this gene in RiBi.

      Author response:

      We have followed Reviewer #1’s suggestion and revised the sentence in lines 35–37 to: “In this study, we discovered a role for the gene NudC (nuclear distribution C, dynein complex regulator) in RiBi within polyploid cells of Drosophila melanogaster larvae.”

      __

      - Introduction, line 66. The protein is imported into the nucleus, where it localizes to the nucleolus - technically the protein is not imported into the nucleolus.

      Author response:

      To correct the misrepresentation in line 66, we have revised the sentence to: “RP mRNAs are synthesized by RNA polymerase II, and exported to the cytoplasm for translation. Then, RPs are imported into the nucleus, where they localize to the nucleolus.” in lines 70–73.

      __ - Introduction, line 70. To be comprehensive in the description of ribosome biogenesis, the authors may want to mention that the 40S and 60S subunits are then exported from the nucleus and form the 80S subunit in the cytoplasm during translation.

      Author response:

      To improve the representation, we have revised the sentences in lines 73 – 78 as follows: “Within the nucleolus, rRNAs and RPs assemble into pre-40S and pre-60S subunits. immature versions of the small (40S) and large (60S) subunits, respectively, that undergo maturation with numerous ribosome biogenesis factors (RBFs) (Greber, 2016). The 40S and 60S subunits are then transported separately to the cytoplasm, where they combine to form functional 80S ribosomes, capable of sustaining protein synthesis (Pelletier et al., 2018).”

      __ - Introduction, line 98. May want to cite paper showing that Minute mutations turn out to be mutations in individual ribosomal protein genes.

      Author response:

      As Reviewer #1 suggested, we have cited two, Marygold et al. (2007) entitled “The ribosomal protein genes and Minute loci of Drosophila melanogaster” and Recasens-Alvarez et al. (2021) entitled “Ribosomopathy-associated mutations cause proteotoxic stress that is alleviated by TOR inhibition” along with He et al. (2015). The inappropriate citation to Brehme (1939) has been removed.

      __ - Results, lines 292. Since they didn't knock down NudC in the fat body cells in this experiment, this comment seems irrelevant.

      Author response:

      We would like to clarify that the phenotype observed with fkh-GAL4-driven NudC RNAi was specific to salivary glands, and no obvious phenotypes were detected in the surrounding fat body cells, which do not express fkh-GAL4. In this context, the adjacent fat body cells serve as an internal control.

      In the revised manuscript, the sentence has been rewritten as: “In contrast, the fat body cells surrounding NudC-deficient SGs did not show this reduction (Figure S9),” in lines 323–324.

      __ - Figure 6A. Hoechst is misspelled.

      __

      - Fig. 2 I - Hoeschest should be Hoescht.

      Author response:

      We have fixed the error.

      __ *- Given that prothoracic gland (PG) size influences ecdysone production, the finding that NudC knockdown alters PG cell size, morphology, and cytoskeletal organization raises the possibility that ecdysone synthesis or signaling may also be affected. This, in turn, could explain the delayed maturation phenotype observed in Figure 1. I recommend testing whether ectopic activation of ecdysone signaling, for instance through 20-hydroxyecdysone (20E) supplementation, can rescue the defects in PG size and developmental timing. Such an experiment would strengthen the link between NudC function, PG morphology, and ecdysone-dependent developmental progression. *

      Author response:

      We have conducted experiments showing that developmental defects in NudC RNAi animals can be partially rescued by administering 20E. Approximately 32% of NudC RNAi larvae fed with 20E completed pupariation. These new data have been added to Figure S1B and are described in the main text (lines 165-168).

      Regarding PG size, our experiments show that PG growth remains inhibited following 20E administration (Figure B as shown below). This observation indicates that treatment with exogenous 20E does not restore PG growth in NudC RNAi animals, suggesting that other factors may be required for normal PG development beyond ecdysone supplementation.

      Because this analysis is not the main focus of our manuscript, we currently plan not to include these data in the revised manuscript.

      Figure B. Prothoracic gland (PG) size ____after 20E administration.

      To assess whether 20E supplementation could restore PG size, control (phtm>dicer2, +) and NudC RNAi (phtm>dicer2, NudC RNAi) larvae were transferred at 60 hours after hatching (hAH) to standard medium containing 20E dissolved in 100% ethanol. Control groups were transferred to medium containing the same volume of 100% ethanol at the same time point. PG size was quantified at the wandering stage. Sample sizes (number of glands) are shown below each bar. Bars represent mean ± SD. **p * *

      __ - Additionally, qRT-PCR can be performed to assess the expression levels of ecdysone precursors or target genes in whole larvae, serving as a readout of ecdysone activity, including dilp8, which is usually upregulated when ecdysone levels are reduced.

      Author response: To investigate ecdysone biosynthesis, Halloween genes including nvd, spok, sro, phm, dib, and sad were measured by conducting qRT-PCR. In NudC RNAi animals, nvd, sro and phm were suppressed at late L3 stage, indicating that NudC in the PG is required for ecdysone biosynthesis. The new data are described in Figure S1A and in the main text (lines 159-164) in the revised manuscript.

      __ - The current images of autophagic vesicles in the SG in Fig. 8B are not clearly visible and quantified. Considering the large size of these polyploid cells, higher-resolution images or alternative imaging approaches should be presented to better visualize and quantify autophagy. This would make the conclusions regarding enhanced autophagy more convincing.

      Author response:

      Regarding the image quality issue, we have provided improved images of anti-Atg8a immunostaining in the salivary gland mosaic clones (Figure 8B) and included additional data from SG-specific knockdown cells (Supplemental Figures S13A-S13F) to provided quantitative results.

      __ - Furthermore, including experiments in other cell types, such as imaginal disc cells, where apoptosis is more readily induced, would help determine whether the effects of NudC knockdown are specific to polyploid cells or are more broadly applicable.

      Author response: We found that apoptosis was observed in NudC RNAi wing discs. In the revised manuscript, we have included this data in Figure S7 and referenced it in the main text (lines 249–256).

      4. Description of analyses that authors prefer not to carry out

      __ - Results, lines 285 to 298. In situs with multiple probes that detect all parts of both the pre-rRNA and processed rRNA indicate that all are down in the SG in NudC knockdowns, but that the 18S and 28S rRNAs are down the internal transcribed spacers go up - can the authors explain or hypothesize how this could happen?

      Author response:

      As Reviewer #1 indicated, we indeed observed that internal transcribed spacer (ITS) levels decrease in NudC knockdown salivary glands, but increase in knockdown fat bodies. Our hypothesis is that, as noted in the Discussion (lines 529–534), ribosome abundance is typically linked to protein synthesis. Salivary gland cells, which are highly active in protein production, may be particularly sensitive to disruptions in ribosome biogenesis. Therefore, NudC may maintain appropriate levels of rRNA with its impact varying according to the specific regulatory mechanisms of each cell type. We do not have a further explanation for this phenomenon, and therefore we have retained the original sentences without adding new ones.

      __ - The data presented in Fig 4 show that NudC knockdown reduces pre-rRNA (ITS1/ITS2) and mature 18S/28S rRNAs in a tissue-specific manner. However, it remains unclear whether these reductions have functional consequences for ribosome assembly and translation. I recommend that the authors perform polysome profiling or an equivalent assay to assess the impact of NudC loss on actively translating ribosomes. This approach would provide a quantitative readout of translation efficiency and clarify whether the observed rRNA defects lead to impaired protein synthesis. Additionally, polysome profiling could help explain the tissue-specific differences observed between salivary glands and fat body cells.

      Author response:

      We performed ribosome fractionation using wild-type salivary glands and repeated the experiment three times with 56–62 gland pairs per sample. As shown in Figure C, the polyribosome peaks (grey lines) are not prominent, indicating that a much larger number of glands would be required for robust polysome profiling. Given that NudC RNAi salivary glands are significantly smaller than wild-type glands, collecting enough tissue for equivalent profiling would be technically difficult. Therefore, we concluded that obtaining sufficient RNAi samples for polysome profiling is extremely challenging, and these data have not been included in the revised manuscript.

      On the other hand, we would like to emphasize that we observed a significant reduction in O-propargyl puromycin (OPP) labeling in NudC-deficient salivary gland cells (Figure 3B), which provides strong evidence for reduced translational activity.

      __Figure C. Ribosomal fraction profiles of wild-type salivary glands. __Salivary glands from the late L3 larvae were dissected for analysis. Polyribosome peaks are indicated in grey. The number of salivary gland pairs used for each sample is shown above each bar.

    1. Author Response

      Reviewer #1:

      Hutchings et al. report an updated cryo-electron tomography study of the yeast COP-II coat assembled around model membranes. The improved overall resolution and additional compositional states enabled the authors to identify new domains and interfaces--including what the authors hypothesize is a previously overlooked structural role for the SEC31 C-Terminal Domain (CTD). By perturbing a subset of these new features with mutants, the authors uncover some functional consequences pertaining to the flexibility or stability of COP-II assemblies.

      Overall, the structural and functional work appears reliable, but certain questions and comments should be addressed prior to publication. However, this reviewer failed to appreciate the conceptual advance that warrants publication in a general biology journal like eLIFE. Rather, this study provides a valuable refinement of our understanding of COP-II that I believe is better suited to a more specialized, structure-focused journal.

      We agree that in our original submission our description of the experimental setup, indeed similar to previous work, did not fully capture the novel findings of this paper. Rather than being simply a higher resolution structure of the COPII coat, in fact we have discovered new interactions in the COPII assembly network, and we have probed their functional roles, significantly changing our understanding of the mechanisms of COPII-mediated membrane curvature. In the revised submission we have included additional genetic data that further illuminate this mechanism, and have rewritten the text to better communicate the novel aspects of our work.

      Our combination of structural, functional and genetic analyses goes beyond refining our textbook understanding of the COPII coat as a simple ‘adaptor and cage’, but rather it provides a completely new picture of how dynamic regulation of assembly and disassembly of a complex network leads to membrane remodelling.

      These new insights have important implications for how coat assembly provides structural force to bend a membrane but is still able to adapt to distinct morphologies. These questions are at the forefront of protein secretion, where there is debate about how different types of carriers might be generated that can accommodate cargoes of different size.

      Major Comments: 1) The authors belabor what this reviewer thinks is an unimportant comparison between the yeast reconstruction of the outer coat vertex with prior work on the human outer coat vertex. Considering the modest resolution of both the yeast and human reconstructions, the transformative changes in cryo-EM camera technology since the publication of the human complex, and the differences in sample preparation (inclusion of the membrane, cylindrical versus spherical assemblies, presence of inner coat components), I did not find this comparison informative. The speculations about a changing interface over evolutionary time are unwarranted and would require a detailed comparison of co-evolutionary changes at this interface. The simpler explanation is that this is a flexible vertex, observed at low resolution in both studies, plus the samples are very different.

      We do agree that our proposal that the vertex interface changes over evolutionary time is speculative and we have removed this discussion. We agree that a co-evolutionary analysis will be enlightening here, but is beyond the scope of the current work.

      We respectfully disagree with the reviewer’s interpretation that the difference between the two vertices is due to low resolution. The interfaces are clearly different, and the resolutions of the reconstructions are sufficient to state this. The reviewer’s suggestion that the difference in vertex orientation might be simply attributable to differences in sample, such as inclusion of the membrane, cylindrical versus spherical morphology, or presence of inner coat components were ruled out in our original submission: we resolved yeast vertices on spherical vesicles (in addition to those on tubes) and on membrane-less cages. These analyses clearly showed that neither the presence of a membrane, nor the change in geometry (tubular vs. spherical) affect vertex interactions. These experiments are presented in Supplementary Fig 4 (Supplementary Fig. 3 in the original version). Similarly, we discount that differences might be due to the presence or absence of inner coat components, since membrane-less cages were previously solved in both conditions and are no different in terms of their vertex structure (Stagg et al. Nature 2006 and Cell 2008).

      We believe it is important to report on the differences between the two vertex structures. Nevertheless, we have shifted our emphasis on the functional aspects of vertex formation and moved the comparison between the two vertices to the supplement.

      2) As one of the major take home messages of the paper, the presentation and discussion of the modeling and assignment of the SEC31-CTD could be clarified. First, it isn't clear from the figures or the movies if the connectivity makes sense. Where is the C-terminal end of the alpha-solenoid compared to this new domain? Can the authors plausibly account for the connectivity in terms of primary sequence? Please also include a side-by-side comparison of the SRA1 structure and the CTD homology model, along with some explanation of the quality of the model as measured by Modeller. Finally, even if the new density is the CTD, it isn't clear from the structure how this sub-stoichiometric and apparently flexible interaction enhances stability. Hence, when the authors wrote "when the [CTD] truncated form was the sole copy of Sec31 in yeast, cells were not viable, indicating that the novel interaction we detect is essential for COPII coat function." Maybe, but could this statement be a leap to far? Is it the putative interaction essential, or is the CTD itself essential for reasons that remain to be fully determined?

      The CTD is separated from the C-terminus of the alpha solenoid domain by an extended domain (~350 amino acids) that is predicted to be disordered, and contains the PPP motifs and catalytic fragment that contact the inner coat. This is depicted in cartoon form in Figures 3A and 7, and discussed at length in the text. This arrangement explains why no connectivity is seen, or expected. We could highlight the C-terminus of the alpha-solenoid domain to emphasize where the disordered region should emerge from the rod, but connectivity of the disordered domain to the CTD could arise from multiple positions, including from an adjacent rod.

      The reviewer’s point about the essentiality of the CTD being independent of its interaction with the Sec31 rod, is an important one. The basis for our model that the CTD enhances stability or rigidity of the coat is the yeast phenotype of Sec31-deltaCTD, which resembles that of a sec13 null. Both mutants are lethal, but rescued by deletion of emp24, which leads to more easily deformable membranes (Čopič et al. Science 2012). We agree that even if this model is true, the interaction of the CTD with Sec31 that our new structure reveals is not proven to drive rigidity or essentiality. We have tempered this hypothesis and added alternative possibilities to the discussion.

      We have included the SRA1 structure in Supplementary Fig 5, as requested, and the model z-score in the Methods. The Z-score, as calculated by the proSA-web server is -6.07 (see figure below, black dot), and falls in line with experimentally determined structures including that of the template (PDB 2mgx, z-score = -5.38).

      img

      3) Are extra rods discussed in Fig. 4 are a curiosity of unclear functional significance? This reviewer is concerned that these extra rods could be an in vitro stoichiometry problem, rather than a functional property of COP-II.

      This is an important point, that, as we state in the paper, cannot be answered at the moment: the resolution is too low to identify the residues involved in the interaction. Therefore we are hampered in our ability to assess the physiological importance of this interaction. We still believe the ‘extra’ rods are an important observation, as they clearly show that another mode of outer coat interaction, different from what was reported before, is possible.

      The concern that interactions visualised in vitro might not be physiologically relevant is broadly applicable to structural biology approaches. However, our experimental approach uses samples that result from active membrane remodelling under near-physiological conditions, and we therefore expect these to be less prone to artefacts than most in vitro reconstitution approaches, where proteins are used at high concentrations and in high salt buffer conditions.

      4) The clashsccore for the PDB is quite high--and I am dubious about the reliability of refining sidechain positions with maps at this resolution. In addition to the Ramchandran stats, I would like to see the Ramachandran plot as well as, for any residue-level claims, the density surrounding the modeled side chain (e.g. S742).

      The clashscore is 13.2, which, according to molprobity, is in the 57th percentile for all structures and in the 97th for structures of similar resolutions. We would argue therefore that the clashscore is rather low. In fact, the model was refined from crystal structures previously obtained by other groups, which had worse clashscore (17), despite being at higher resolution. Our refinement has therefore improved the clashscore. During refinement we have chosen restraint levels appropriate to the resolution of our map (Afonine et al., Acta Cryst D 2018)

      The Ramachandran plot is copied here and could be included in a supplemental figure if required. We make only one residue-level claim (S742), the density for which is indeed not visible at our resolution. We claim that S742 is close to the Sec23-23 interface, and do not propose any specific interactions. Nevertheless we have removed reference to S742 from the manuscript. We included this specific information because of the potential importance of this residue as a site of phosphorylation, thereby putting this interface in broader context for the general eLife reader.

      img

      Minor Comments:

      1) The authors wrote "To assess the relative positioning of the two coat layers, we analysed the localisation of inner coat subunits with respect to each outer coat vertex: for each aligned vertex particle, we superimposed the positions of all inner coat particles at close range, obtaining the average distribution of neighbouring inner coat subunits. From this 'neighbour plot' we did not detect any pattern, indicating random relative positions. This is consistent with a flexible linkage between the two layers that allows adaptation of the two lattices to different curvatures (Supplementary Fig 1E)." I do not understand this claim, since the pattern both looks far from random and the interactions depend on molecular interactions that are not random. Please clarify.

      We apologize for the confusion: the pattern of each of the two coats are not random. Our sentence refers to the positions of inner and outer coats relative to each other. The two lattices have different parameters and the two layers are linked by flexible linkers (the 350 amino acids referred to above). We have now clarified the sentence.

      2) Related to major point #1, the author wrote "We manually picked vertices and performed carefully controlled alignments." I do now know what it means to carefully control alignments, and fear this suggests human model bias.

      We used different starting references for the alignments, with the precise aim to avoid model bias. For both vesicle and cage vertex datasets, we have aligned the subtomograms against either the vertex obtained from tubules, or the vertex from previously published membrane-less cages. In all cases, we retrieved a structure that resembles the one on tubules, suggesting that the vertex arrangement we observe isn’t simply the result of reference bias. This procedure is depicted in Supplementary Fig 4 (Supplementary Fig. 3 in the original manuscript), but we have now clarified it also in the methods section.

      3) Why do some experiments use EDTA? I may be confused, but I was surprised to see the budding reaction employed 1mM GMPPNP, and 2.5mM EDTA (but no Magnesium?). Also, for the budding reaction, please replace or expand upon the "the 10% GUV (v/v)" with a mass or molar lipid-to-protein ratio.

      We regret the confusion. As stated in the methods, all our budding reactions are performed in the presence of EDTA and Magnesium, which is present in the buffer (at 1.2 mM). The reason is to facilitate nucleotide exchange, as reported and validated in Bacia et al., Scientific Reports 2011.

      Lipids in GUV preparations are difficult to quantify. We report the stock concentrations used, but in each preparation the amount of dry lipid that forms GUVs might be different, as is the concentration of GUVs after hydration. However since we analyse reactions where COPII proteins have bound and remodelled individual GUVs, we do not believe the protein/lipid ratio influences our structures.

      4) Please cite the AnchorMap procedure.

      We cite the SerialEM software, and are not aware of other citations specifically for the anchor map procedure.

      5) Please edit for typos (focussing, functionl, others)

      Done

      Reviewer #2:

      The manuscript describes new cryo-EM, biochemistry, and genetic data on the structure and function of the COPII coat. Several new discoveries are reported including the discovery of an extra density near the dimerization region of Sec13/31, and "extra rods" of Sec13/31 that also bind near the dimerization region. Additionally, they showed new interactions between the Sec31 C-terminal unstructured region and Sec23 that appear to bridge multiple Sec23 molecules. Finally, they increased the resolution of the Sec23/24 region of their structure compared to their previous studies and were able to resolve a previously unresolved L-loop in Sec23 that makes contact with Sar1. Most of their structural observations were nicely backed up with biochemical and genetic experiments which give confidence in their structural observations. Overall the paper is well-written and the conclusions justified.

      However, this is the third iteration of structure determination of the COPII coat on membrane with essentially the same preparation and methods. Each time, there has been an incremental increase in resolution and new discoveries, but the impact of the present study is deemed to be modest. The science is good, but it may be more appropriate for a more specialized journal. Areas of specific concern are described below.

      As described above, we respectfully disagree with this interpretation of the advance made by the current work. This work improves on previous work in many aspects. The resolution of the outer coat increases from over 40A to 10-12A, allowing visualisation of features that were not previously resolved, including a novel vertex arrangement, the Sec31 CTD, and the outer coat ‘extra rods’. An improved map of the inner coat also allows us to resolve the Sec23 ‘L-loop’. We would argue that these are not just extra details, but correspond to a suite of novel interactions that expand our understanding of the complex COPII assembly network. Moreover, we include biochemical and genetic experiments that not only back up our structural observations but bring new insights into COPII function. As pointed out in response to reviewer 1, we believe our work contributes a significant conceptual advance, and have modified the manuscript to convey this more effectively.

      1) The abstract is vague and should be re-written with a better description of the work.

      We have modified the abstract to specifically outline what we have done and the major new discoveries of this paper.

      2) Line 166 - "Surprisingly, this mutant was capable of tubulating GUVs". This experiment gets to one of the fundamental unknown questions in COPII vesiculation. It is not clear what components are driving the membrane remodeling and at what stages during vesicle formation. Isn't it possible that the tubulation activity the authors observe in vitro is not being driven at all by Sec13/31 but rather Sec23/24-Sar1? Their Sec31ΔCTD data supports this idea because it lacks a clear ordered outer coat despite making tubules. An interesting experiment would be to see if tubules form in the absence of all of Sec13/31 except the disordered domain of Sec31 that the authors suggest crosslinks adjacent Sec23/24s.

      This is an astute observation, and we agree with the reviewer that the source of membrane deformation is not fully understood. We favour the model that budding is driven significantly by the Sec23-24 array. To further support this, we have performed a new experiment, where we expressed Sec31ΔN in yeast cells lacking Emp24, which have more deformable membranes and are tolerant to the otherwise lethal deletion of Sec13. While Sec31ΔN in a wild type background did not support cell viability, this was rescued in a Δemp24 yeast strain, strongly supporting the hypothesis that a major contributor to membrane remodelling is the inner coat, with the outer coat becoming necessary to overcome membrane bending resistance that ensues from the presence of cargo. We now include these results in Figure 1.

      However, we must also take into account the results presented in Fig. 6, where we show that weakening the Sec23-24 interface still leads to budding, but only if Sec13-31 is fully functional, and that in this case budding leads to connected pseudo-spherical vesicles rather than tubes. When Sec13-31 assembly is also impaired, tubes appear unstructured. We believe this strongly supports our conclusions that both inner and outer coat interactions are fundamental for membrane remodelling, and it is the interplay between the two that determines membrane morphology (i.e. tubes vs. spheres).

      To dissect the roles of inner and outer coats even further, we have done the experiment that the reviewer suggests: we expressed Sec31768-1114, but the protein was not well-behaved and co-purified with chaperones. We believe the disordered domain aggregates when not scaffolded by the structured elements of the rod. Nonetheless, we used this fragment in a budding reaction, and could not see any budding. We did not include this experiment as it was inconclusive: the lack of functionality of the purified Sec31 fragment could be attributed to the inability of the disordered region to bind its inner coat partner in the absence of the scaffolding Sec13-31 rod. As an alternative approach, we have used a version of Sec31 that lacks the CTD, and harbours a His tag at the N-terminus (known from previous studies to partially disrupt vertex assembly). We think this construct is more likely to be near native, since both modifications on their own lead to functional protein. We could detect no tubulation with this construct by negative stain, while both control constructs (Sec31ΔCTD and Nhis-Sec31) gave tubulation. This suggests that the cross-linking function of Sec31 is not sufficient to tubulate GUV membranes, but some degree of functional outer coat organisation (either mediated by N- or C-terminal interactions) is needed. It is also possible that the lack of outer coat organisation might lead to less efficient recruitment to the inner coat and cross-linking activity. We have added this new observation to the manuscript.

      3) Line 191 - "Inspecting cryo-tomograms of these tubules revealed no lozenge pattern for the outer 192 coat" - this phrasing is vague. The reviewer thinks that what they mean is that there is a lack of order for the Sec13/31 layer. Please clarify.

      The reviewer is correct, we have changed the sentence.

      4) Line 198 - "unambiguously confirming this density corresponds to 199 the CTD." This only confirms that it is the CTD if that were the only change and the Sec13/31 lattice still formed. Another possibility is that it is density from other Sec13/31 that only appears when the lattice is formed such as the "extra rods". One possibility is that the density is from the extra rods. The reviewer agrees that their interpretation is indeed the most likely, but it is not unambiguous. The authors should consider cross-linking mass spectrometry.

      We have removed the word ‘unambiguously’, and changed to ‘confirming that this density most likely corresponds to the CTD’. Nonetheless, we believe that our interpretation is correct: the extra rods bind to a different position, and themselves also show the CTD appendage. In this experiment, the lack of the CTD was the only biochemical change.

      5) In the Sec31ΔCTD section, the authors should comment on why ΔCTD is so deleterious to oligomer organization in yeast when cages form so abundantly in preparations of human Sec13/31 ΔC (Paraan et al 2018).

      We have added a comment to address this. “Interestingly, human Sec31 proteins lacking the CTD assemble in cages, indicating that either the vertex is more stable for human proteins and sufficient for assembly, or that the CTD is important in the context of membrane budding but not for cage formation in high salt conditions.”

      6) The data is good for the existence of the "extra rods", but significance and importance of them is not clear. How can these extra densities be distinguished from packing artifacts due to imperfections in the helical symmetry.

      Please also see our response to point 3 from reviewer 1. Regarding the specific concern that artefacts might be a consequence of imperfection in the helical symmetry, we would argue such imperfections are indeed expected in physiological conditions, and to a much higher extent. For this reason interactions seen in the context of helical imperfections are likely to be relevant. In fact, in normal GTP hydrolysis conditions, we expect long tubes would not be able to form, and the outer coat to be present on a wide range of continuously changing membrane curvatures. We think that the ability of the coat to form many interactions when the symmetry is imperfect might be exactly what confers the coat its flexibility and adaptability.

      7) Figure 5 is very hard to interpret and should be redone. Panels B and C are particularly hard to interpret.

      We have made a new figure where we think clarity is improved.

      8) The features present in Sec23/24 structure do not reflect the reported resolution of 4.7 Å. It seems that the resolution is overestimated.

      We report an average resolution of 4.6 Å. In most of our map we can clearly distinguish beta strands, follow the twist of alpha helices and see bulky side chains. These features typically become visible at 4.5-5A resolution. We agree that some areas are worse than 4.6 Å, as typically expected for such a flexible assembly, but we believe that the average resolution value reported is accurate. We obtained the same resolution estimate using different software including relion, phenix and dynamo, so that is really the best value we can provide. To further convince ourselves that we have the resolution we claim, we sampled EM maps from the EMDB with the same stated resolution (we just took the 7 most recent ones which had an associated atomic model), and visualised their features at arbitrary positions. For both beta strands and alpha helices, we do not feel our map looks any worse than the others we have examined. We include a figure here.

      img

      9) Lines 315/316 - "We have combined cryo-tomography with biochemical and genetic assays to obtain a complete picture of the assembled COPII coat at unprecedented resolution (Fig. 7)"

      10) Figure 7. is a schematic model/picture the authors should reference a different figure or rephrase the sentence.

      We now refer to Fig 7 in a more appropriate place.

      Reviewer #3:

      The manuscript by Hutchings et al. describes several previously uncharacterised molecular interactions in the coats of COP-II vesicles by using a reconstituted coats of yeast COPI-II. They have improved the resolution of the inner coat to 4.7A by tomography and subtomogram averaging, revealing detailed interactions, including those made by the so-called L-loop not observed before. Analysis of the outer layer also led to new interesting discoveries. The sec 31 CTD was assigned in the map by comparing the WT and deletion mutant STA-generated density maps. It seems to stabilise the COP-II coats and further evidence from yeast deletion mutants and microsome budding reconstitution experiments suggests that this stabilisation is required in vitro. Furthermore, COP-II rods that cover the membrane tubules in right-handed manner revealed sometimes an extra rod, which is not part of the canonical lattice, bound to them. The binding mode of these extra rods (which I refer to here a Y-shape) is different from the canonical two-fold symmetric vertex (X-shape). When the same binding mode is utilized on both sides of the extra rod (Y-Y) the rod seems to simply insert in the canonical lattice. However, when the Y-binding mode is utilized on one side of the rod and the X-binding mode on the other side, this leads to bridging different lattices together. This potentially contributes to increased flexibility in the outer coat, which maybe be required to adopt different membrane curvatures and shapes with different cargos. These observations build a picture where stabilising elements in both COP-II layers contribute to functional cargo transport. The paper makes significant novel findings that are described well. Technically the paper is excellent and the figures nicely support the text. I have only minor suggestions that I think would improve the text and figure.

      We thank the reviewer for helpful suggestions which we agree improve the manuscript.

      Minor Comments:

      L 108: "We collected .... tomograms". While the meaning is clear to a specialist, this may sound somewhat odd to a generic reader. Perhaps you could say "We acquired cryo-EM data of COP-II induced tubules as tilt series that were subsequently used to reconstruct 3D tomograms of the tubules."

      We have changed this as suggested

      L 114: "we developed an unbiased, localisation-based approach". What is the part that was developed here? It seems that the inner layer particle coordinates where simply shifted to get starting points in the outer layer. Developing an approach sounds more substantial than this. Also, it's unclear what is unbiased about this approach. The whole point is that it's biased to certain regions (which is a good thing as it incorporates prior knowledge on the location of the structures).

      We have modified the sentence to “To target the sparser outer coat lattice for STA, we used the refined coordinates of the inner coat to locate the outer coat tetrameric vertices”, and explain the approach in detail in the methods.

      L 124: "The outer coat vertex was refined to a resolution of approximately ~12 A, revealing unprecedented detail of the molecular interactions between Sec31 molecules (Supplementary Fig 2A)". The map alone does not reveal molecular interactions; the main understanding comes from fitting of X-ray structures to the low-resolution map. Also "unprecedented detail" itself is somewhat problematic as the map of Noble et al (2013) of the Sec31 vertex is also at nominal resolution of 12 A. Furthermore, Supplementary Fig 2A does not reveal this "unprecedented detail", it shows the resolution estimation by FSC. To clarify, these points you could say: "Fitting of the Sec31 atomic model to our reconstruction vertex at 12-A resolution (Supplementary Fig 2A) revealed the molecular interactions between different copies of Sec31 in the membrane-assembled coat.

      We have changed the sentence as suggested.

      L 150: Can the authors exclude the possibility that the difference is due to differences in data processing? E.g. how the maps amplitudes have been adjusted?

      Yes, we can exclude this scenario by measuring distances between vertices in the right and left handed direction. These measurements are only compatible with our vertex arrangement, and cannot be explained by the big deviation from 4-fold symmetry seen in the membrane-less cage vertices.

      L 172: "that wrap tubules either in a left- or right-handed manner". Don't they do always both on each tubule? Now this sentence could be interpreted to mean that some tubules have a left-handed coat and some a right-handed coat.

      We have changed this sentence to clarify. “Outer coat vertices are connected by Sec13-31 rods that wrap tubules both in a left- and right-handed manner.”

      L276: "The difference map" hasn't been introduced earlier but is referred to here as if it has been.

      We now introduce the difference map.

      L299: Can "Secondary structure predictions" denote a protein region "highly prone to protein binding"?

      Yes, this is done through DISOPRED3, a feature include in the PSIPRED server we used for our predictions. The reference is: Jones D.T., Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity Bioinformatics. 2015; 31:857–863. We have now added this reference to the manuscript.

      L316: It's true that the detail in the map of the inner coat is unprecedented and the model presented in Figure 7 is partially based on that. But here "unprecedented resolution" sounds strange as this sentence refers to a schematic model and not a map.

      We have changed this by moving the reference to Fig 7 to a more appropriate place

      L325: "have 'compacted' during evolution" -> remove. It's enough to say it's more compact in humans and less compact in yeast as there could have been different adaptations in different organisms at this interface.

      We have changed as requested. See also our response to reviewer 1, point 1.

      L327: What's exactly meant by "sequence diversity or variability at this density".

      We have now clarified: “Since multiple charge clusters in yeast Sec31 may contribute to this interaction interface (Stancheva et al., 2020), the low resolution could be explained by the fact that the density is an average of different sequences.”

      L606-607: The description of this custom data processing approach is difficult to follow. Why is in-plane flip needed and how is it used here?

      Initially particles are picked ignoring tube directionality (as this cannot be assessed easily from the tomograms due to the pseudo-twofold symmetry of the Sec23/24/Sar1 trimer). So the in plane rotation of inner coat subunit could be near 0 or 180°. For each tube, both angles are sampled (in-plane flip). Most tubes result in the majority of particles being assigned one of the two orientations (which is then assumed as the tube directionality). Particles that do not conform are removed, and rare tubes where directionality cannot be determined are also removed. We have re-written the description to clarify these points: “Initial alignments were conducted on a tube-by-tube basis using the Dynamo in-plane flip setting to search in-plane rotation angles 180° apart. This allowed to assign directionality to each tube, and particles that were not conforming to it were discarded by using the Dynamo dtgrep_direction command in custom MATLAB scripts”

      L627: "Z" here refers to the coordinate system of aligned particles not that of the original tomogram. Perhaps just say "shifted 8 pixels further away from the membrane".

      Changed as requested.

      L642-643: How can the "left-handed" and "right-handed" rods be separated here? These terms refer to the long-range organisation of the rods in the lattice it's not clear how they were separated in the early alignments.

      They are separated by picking only one subset using the dynamo sub-boxing feature. This extracts boxes from the tomogram which are in set positions and orientation relative to the average of previously aligned subtomograms. From the average vertex structure, we sub-box rods at 4 different positions that correspond to the centre of the rods, and the 2-fold symmetric pairs are combined into the same dataset. We have clarified this in the text: “The refined positions of vertices were used to extract two distinct datasets of left and right-handed rods respectively using the dynamo sub-boxing feature.”

      Figure 2B. It's difficult to see the difference between dark and light pink colours.

      We have changed colours to enhance the difference.

      Figure 3C. These panels report the relative frequency of neighbouring vertices at each position; "intensity" does not seem to be the right measure for this. You could say that the colour bar indicates the "relative frequency of neighbouring vertices at each position" and add detail how the values were scaled between 0 and 1. The same applies to SFigure 1E.

      Changed as requested.

      Figure 4. The COP-II rods themselves are relatively straight, and they are not left-handed or right-handed. Here, more accurate would be "architecture of COPII rods organised in a left-handed manner". (In the text the authors may of course define and then use this shorter expression if they so wish.) Panel 4B top panel could have the title "left-handed" and the lower panel should have the title "right-handed" (for consistency and clarity).

      We have now defined left- and right-handed rods in the text, and have changed the figure and panel titles as requested.

    1. Author Response

      Reviewer #1:

      This paper addresses the very interesting topic of genome evolution in asexual animals. While the topic and questions are of interest, and I applaud the general goal of a large-scale comparative approach to the questions, there are limitations in the data analyzed. Most importantly, as the authors raise numerous times in the paper, questions about genome evolution following transitions to asexuality inherently require lineage-specific controls, i.e. paired sexual species to compare with the asexual lineages. Yet such data are currently lacking for most of the taxa examined, leaving a major gap in the ability to draw important conclusions here. I also do not think the main positive results, such as the role of hybridization and ploidy on the retention and amount of heterozygosity, are novel or surprising.

      We agree with the reviewer that having the sexual outgroups would improve the interpretations; this is one of the points we make in our manuscript. Importantly however, all previous genome studies of asexual species focus on individual asexual lineages, generally without sexual species for comparison. Yet reported genome features have been interpreted as consequences of asexuality (e.g., Flot et al. 2013). By analysing and comparing these genomes, we can show that these features are in fact lineage-specific rather than general consequences of asexuality. Unexpectedly, we find that asexuals that are not of hybrid origin are largely homozygous, independently of the cellular mechanism underlying asexuality. This contrasts with the general view that cellular mechanisms such as central fusion (which facilitates heterozygosity retention between generation) promotes the evolutionary success of asexual lineages relative to mechanisms such as gamete duplication (which generate complete homozygosity) by delaying the expression of the recessive load. We also do not observe the expected relationship between cellular mechanism of asexuality and heterozygosity retention in species of hybrid origin. Thus we respectfully disagree that our results are not surprising. Reviewer #2 found our results “interesting” and a “potentially important contribution”, and reviewer #3 wrote that we “call into question the generality of the theoretical expectations, and suggest that the genomic impacts of asexuality may be more complicated than previously thought”.

      We also make it very clear that some of the patterns we uncover (e.g. low TE loads in asexual species) cannot be clearly evaluated with asexuals alone. Our study emphasizes the importance of the fact that asexuality is a lineage-level trait and that comparative analyses using asexuals requires lineage-level replication in addition to comparisons to sexual species.

      References

      Flot, Jean-François, et al. "Genomic evidence for ameiotic evolution in the bdelloid rotifer Adineta vaga." Nature 500.7463 (2013): 453-457.

      Reviewer #2:

      [...] Major Issues and Questions:

      1) The authors choose to refer to asexuality when describing thelytokous parthenogenesis. Asexuality is a very general term that can be confusing: fission, vegetative reproduction could also be considered asexuality. I suggest using parthenogenesis throughout the manuscript for the different animal clades studied here. Moreover, in thelytokous parthenogenesis meiosis can still occur to form the gametes, it is therefore not correct to write that "gamete production via meiosis... no longer take place" (lines 57-58). Fertilization by sperm indeed does not seem to take place (except during hybridogenesis, a special form of parthenogenesis).

      We will clarify more explicitly what asexuality refers to in our manuscript. Notably our study does not include species that produce gametes which are fertilized (which is the case under hybridogenesis, which sensu stricto is not a form of parthenogenesis). Even though many forms of parthenogenesis do indeed involve meiosis (something we explain in much detail in box 2), there is no production of gametes.

      2) The cellular mechanisms of asexuality in many asexual lineages are known through only a few, old cytological studies and could be inaccurate or incomplete (for example Triantaphyllou paper of 1981 of Meloidogyne nematodes or Hsu, 1956 for bdelloid rotifers). The authors should therefore mention in the introduction the lack of detailed and accurate cellular and genetic studies to describe the mode of reproduction because it may change the final conclusion.

      For example, for bdelloid rotifers the literature is scarce. However the authors refer in Supp Table 1 to two articles that did not contain any cytological data on oogenesis in bdelloid rotifers to indicate that A. vaga and A. ricciae use apomixis as reproductive mode. Welch and Meselson studied the karyotypes of bdelloid rotifers, including A. vaga, and did not conclude anything about absence or presence of chromosome homology and therefore nothing can be said about their reproduction mode. In the article of Welch and Meselson the nuclear DNA content of bdelloid species is measured but without any link with the reproduction mode. The only paper referring to apomixis in bdelloids is from Hsu (1956) but it is old and new cytological data with modern technology should be obtained.

      We will correct the rotifer citations and thank the reviewer for picking up the error. We agree that there are uncertainties in some cytological studies, but the same is true for genomic studies (which is why we base our analyses as much as possible on raw reads rather than assemblies because the latter may be incorrect). We in fact excluded cytological studies where the findings could not be corroborated. For example, we discarded the evidence for meiosis and diploidy by Handoo at al. 2004 for its incompatibility with genomic data because this study does not provide any verifiable evidence (there are no data or images, only descriptions of observations). We provide all the references in the supplementary material concerning the cytological evidence used.

      3) In the section on Heterozygosity, the authors compute heterozygosity from kmer spectra analysis from reads to "avoid biases from variable genome assembly qualities" (page 16). But such kmer analysis can be biased by the quality and coverage of sequencing reads. While such analyses are a legitimate tool for heterozygosity measurements, this argument (the bias of genome quality) is not convincing and the authors should describe the potential limits of using kmer spectra analyses.

      We excluded all the samples with unsuitable quality of data (e.g. one tardigrade species with excessive contamination or the water flea samples for insufficient coverage), and T. Rhyker Ranallo Benavidez, the author of the method we used, collaborated with us on the heterozygosity analyzes. However, we will clarify the limitations of the method for species with extremely low or high heterozygosity (see also comment 5 of this reviewer).

      4) The authors state that heterozygosity levels “should decay over time for most forms of meiotic asexuality". This is incorrect, as this is not expected with "central fusion" or with "central fusion automixis equivalent" where there is no cytokinesis at meiosis I.

      Our statement is correct. Note that we say “most” and not “all” because certain forms of endoduplication in F1 hybrids result in the maintenance of heterozygosity. Central fusion is expected to fully retain heterozygosity only if recombination is completely suppressed (see for example Suomalainen et al. 1987 or Engelstädter 2017).

      5) I do not fully agree with the authors’ statement that: "In spite of the prediction that the cellular mechanism of asexuality should affect heterozygosity, it appears to have no detectable effect on heterozygosity levels once we control for the effect of hybrid origins (Figure 2)." (page 17)

      The scaling on Figure 2 is emphasizing high values, while low values are not clearly separated. By zooming in on the smaller heterozygosity % values we may observe a bigger difference between the "asexuality mechanisms". I do not see how asexuality mechanism was controlled for, and if you look closely at intra group heterozygosity, variability is sometimes high.

      It is expected that hybrid origin leads to higher heterozygosity levels but saying that asexuality mechanism is not important is surprising: on Figure 2 the orange (central fusion) is always higher than yellow (gamete duplication).

      As we explain in detail in the text, the three comparatively high heterozygosity values under spontaneous origins of asexuality (“orange” points in the bottom left corner of the figure) are found in an only 40-year old clone of the Cape bee. Among species of hybrid origin, we see no correlation between asexuality mechanism and heterozygosity. These observations suggest that the asexuality mechanism may have an impact on genome-wide heterozygosity in recent incipient asexual lineages, but not in established asexual lineages.

      Also, the variability found within rotifers could be an argument against a strong importance of asexuality origin on heterozygosity levels: the four bdelloid species likely share the same origin but their allelic heterozygosity levels appears to range from almost 0 to almost 6% (Fig 2 and 3, however the heterozygosity data on Rotaria should be confirmed, see below).

      We prefer not using the data from rotifers for making such arguments, given the large uncertainty with respect to genome features in this group (including the possibility of octoploidy in some species which we describe in the supplemental information). One could even argue that the highly variable genome structure among rotifer species could indicate repeated transitions to asexuality and/or different hybridization events, but the available genome data would make all these arguments highly speculative.

      The authors’ main idea (i.e. asexuality origin is key) seems mostly true when using homoeolog heterozygosity and/or composite heterozygosity which is not what most readers will usually think as "heterozygosity". This should be made clear by the authors mostly because this kind of heterozygosity does not necessarily undergo the same mechanism as the one described in Box 2 for allelic heterozygosity. If homoeolog heterozygosity is sometimes not distinguishable from allelic heterozygosity, then it would be nice to have another box showing the mechanisms and evolution pattern for such cases (like a true tetraploid, in which all copies exist).

      The heterozygosity between homoeologs is always high in this study while it appears low between alleles, but since the heterozygosity between homeologs can only be measured when there is a hybrid origin, the only heterozygosity that can be compared between ALL the asexual groups is the one between alleles.

      By definition, homoeologs have diverged between species, while alleles have diverged within species. So indeed divergence between homoeologs will generally exceed divergence between alleles. We will consider adding expected patterns in perfect tetraploid species for Box 2.

      Both in the results and the conclusion the authors should not over interpret the results on heterozygosity. The variation in allelic heterozygosity could be small (although not in all asexuals studied) also due to the age of the asexual lineages. This is not mentioned here in the result/discussion section..

      We explain in section Overview of species and genomes studied that age effects are important but that we do not consider them quantitatively because age estimates are not available for the majority of asexual species in our paper.

      6) Regarding the section on Heterozygosity structure in polyploids

      There is inconsistency in many of the numbers. For example, A. vaga heterozygosity is estimated at 1.42% in Figure 1, but then appears to show up around 2% in Figure 2, and then becomes 2.4% on page 20. It is unclear is this is an error or the result of different methods.

      It is also unclear how homologs were distinguished from homeologs. How are 21 bp k-mers considered homologous? In the method section. the authors describe extracting unique k-mer pairs differing by one SNP, so does this mean that no more than one SNP was allowed to define heterozygous homologous regions? Does this mean that homologues (and certainly homoeologs) differing by more than 5% would not be retrieved by this method. If so, then It is not surprising that for A.vaga is classified as a diploid.

      Figure 1 a presents the values reported in the original genome studies, not our results. This is explained in the corresponding figure legend. Hence, 1.42 is the value reported by Flot at al. 2013. 2.4 is the value we measure and it is consistent in Figures 2 and 3.

      We used k-mer pairs differing by one SNP to estimate ploidy (smudgeplot). The heterozygosity estimates were estimated from kmer spectra (GenomeScope 2.0). The kmers that are found in 1n must be heterozygous between homologs, as the homoeolog heterozygosity would produce 2n kmers, We used the kmer approach to estimate heterozygosity in all other cases than homoeologs of rotifers, which were directly derived from the assemblies. We explain this in the legend to Figure 3, but we will add the information also to the Methods section for clarification.

      The result for A. ricciae is surprising and I am still not convinced by the octoploid hypothesis. In Fig S2. there is a first peak at 71x coverage that still could be mostly contaminants. It would be helpful to check the GC distribution of k-mers in the first haploid peak of A. ricciae to check whether there are contaminants. The karyotypes of 12 chromosomes indeed do not fit the octoploid hypothesis. I am also surprised by the 5.5% divergence calculated for A. ricciae, this value should be checked when eliminating potential contaminants (if any). In general, these kind of ambiguities will not be resolved without long-read sequencing technology to improve the genome assemblies of asexual lineages.

      We understand the scepticism of the reviewer regarding the octoploidy hypothesis, but it is important to note that we clearly present it as a possible explanation for the data that needs to be corroborated, i.e., we state that the data are better consistent with octo- than tetraploidy. Contamination seems quite unlikely, as the 71.1x peak represents nearly exactly half the coverage of the otherwise haploid peak (142x). Furthermore, the Smudgeplot analysis shows that some of the kmers from the 71x peak pair with genomic kmers of the main peaks. We also performed KAT analysis (not presented in the manuscript) showing that these kmers are also represented in the decontaminated assembly. We will add this clarification regarding possible contamination to the supplementary materials.

      7) Regarding the section on palindromes and gene conversion

      The authors screened all the published genomes for palindromes, including small blocks, to provide a more robust unbiased view. However, the result will be unbiased and robust if all the genomes compared were assembled using the same sequencing data (quality, coverage) and assembly program. While palindromes appear not to play a major role in the genome evolution of parthenogenetic animals since only few palindromes were detected among all lineages, mitotic (and meiotic) gene conversion is likely to take place in parthenogens and should indeed be studied among all the clades.

      We agree with the reviewer that gene conversion might be one of the key aspects of asexual genome evolution. Our study merely pointed out that genomes of asexual animals do not show organisation in palindromes, indicating that palindromes might not be of general importance in asexual genome evolution. Note also that we clearly point out that these analyses are biased by the quality of the available genome assemblies.

      8) Regarding the section on transposable elements

      The authors are aware that the approach used may underestimate the TEs present in low copy numbers, therefore the comparison might underestimate the TE numbers in certain asexual groups.

      Yes. We clearly explain this limitation in the manuscript. The currently available alternatives are based on assembled genomes, so the results are biased by the quality of the assemblies (and similarities to TEs in public databases) and our aim was to broadly compare genomes in the absence of assembly-generated biases.

      9) Regarding the section on horizontal gene transfer. For the HGTc analysis, annotated genes were compared to the UniRef90 database to identify non-metazoan genes and HGT candidates were confirmed if they were on a scaffold containing at least one gene of metazoan origin. While this method is indeed interesting, it is also biased by the annotation quality and the length of the scaffolds which vary strongly between studies.

      Yes, this is true and we explain many limitations in the supplemental information, but re-assembling and re-annotating all these genomes would be beyond reasonable computational possibilities.

      10) Regarding the use of GenomeScope2.0

      When homologues are very divergent (as observed in bdelloid rotifers) GenomeScope probably considers these distinct haplotypes as errors, making it difficult to model the haploid genome size and giving a high peak of errors in the GenomeScope profile. Moreover, due to the very divergent copies in A. vaga, GenomeScope indeed provides a diploid genome (instead of tetraploid).

      For A. vaga, the heterozygosity estimated par GenomeScope2.0. on our new sequencing dataset is 2% (as shown in this paper). This % corresponds to the heterozygosity between k-mers but does not provide any information on the heterogeneity in heterozygosity measurements along the genome. A limitation of GenomeScope2.0. (which the authors should mention here) is that it is assuming that the entire genome is following the same theoretical k-mer distribution.

      The model of estimating genome wide heterozygosity indeed assumes a random distribution of heterozygous loci and indeed is unable to estimate divergence over a certain threshold, which is the reason why we used genome assemblies for the estimation of divergence of homoeologs. Regarding estimates in all other genomes, the assumptions are unlikely to fundamentally change the output of the analysis. GenomeScope2 is described in detail in a recent paper (Ranallo-Benavidez et al. 2019), where the assumption that heterozygosity rates are constant across the genome is explicitly mentioned.

      References

      Engelstädter, Jan. "Asexual but not clonal: evolutionary processes in automictic populations." Genetics 206.2 (2017): 993-1009.

      Flot, Jean-François, et al. "Genomic evidence for ameiotic evolution in the bdelloid rotifer Adineta vaga." Nature 500.7463 (2013): 453-457.

      Handoo, Z. A., et al. "Morphological, molecular, and differential-host characterization of Meloidogyne floridensis n. sp.(Nematoda: Meloidogynidae), a root-knot nematode parasitizing peach in Florida." Journal of nematology 36.1 (2004): 20.

      Suomalainen, Esko, Anssi Saura, and Juhani Lokki. Cytology and evolution in parthenogenesis. CRC Press, 1987.

      Ranallo-Benavidez, Timothy Rhyker, Kamil S. Jaron, and Michael C. Schatz. "GenomeScope 2.0 and Smudgeplots: Reference-free profiling of polyploid genomes." BioRxiv (2019): 747568. 

      Reviewer #3:

      Jaron and collaborators provide a large-scale comparative work on the genomic impact of asexuality in animals. By analysing 26 published genomes with a unique bioinformatic pipeline, they conclude that none of the expected features due to the transition to asexuality is replicated across a majority of the species. Their findings call into question the generality of the theoretical expectations, and suggest that the genomic impacts of asexuality may be more complicated than previously thought.

      The major strengths of this work is (i) the comparison among various modes and origins of asexuality across 18 independent transitions; and (ii) the development of a bioinformatic pipeline directly based on raw reads, which limits the biases associated with genome assembly. Moreover, I would like to acknowledge the effort made by the authors to provide on public servers detailed methods which allow the analyses to be reproduced. That being said, I also have a series of concerns, listed below:

      We thank this reviewer for the relevant comments and for providing many constructive suggestions in the points below. We will take them into account for our final version of the manuscript.

      1) Theoretical expectations

      As far as I understand, the aim of this work is to test whether 4 classical predictions associated with the transition to asexuality and 5 additional features observed in individual asexual lineages hold at a large phylogenetic scale. However, I think that these predictions are poorly presented, and so they may be hardly understood by non-expert readers. Some of them are briefly mentioned in a descriptive way in the Introduction (L56 - 61), and with a little more details in the Boxes 1 and 2. However, the evolutive reasons why one should expect these features to occur (and under which assumptions) is not clearly stated anywhere in the Introduction (but only briefly in the Results & Discussion). I think it is important that the authors provide clear-cut quantitative expectations for each genomic feature analysed and under each asexuality origin and mode (Box 1 and 2). Also highlighting the assumptions behind these expectations will help for a better interpretation of the observed patterns.

      We will clarify the expectations for non expert readers.

      2) Mutation accumulation & positive selection

      A subtlety which is not sufficiently emphasized to my mind is that the different modes of asexuality encompass reproduction with or without recombination (Box 2), which can lead to very different genetic outcomes. For example, it has been shown that the Muller's ratchet (the accumulation of deleterious mutations in asexual populations) can be stopped by small amounts of recombination in large-sized populations (Charlesworth et al. 1993; 10.1017/S0016672300031086). Similarly a new recessive beneficial mutation can only segregate at a heterozygous state in a clonal lineage (unless a second mutation hits the same locus); whereas in the presence of recombination, these mutations will rapidly fix in the population by the formation of homozygous mutants (Haldane's Sieve, Haldane 1927; 10.1017/S0305004100015644). Therefore, depending on whether recombination occurs or not during asexual reproduction, the expectations may be quite different; and so they could deviate from the "classical predictions". In this regard, I would like to see the authors adjust their conclusions. Moreover, it is also not very clear whether the species analysed here are 100% asexuals or if they sometimes go through transitory sexual phases, which could reset some of the genomic effects of asexuality.

      Yes, the predictions regarding the efficiency of selection are indeed influenced by cellular modes of asexuality. Adding some details or at least a good reference would certainly increase the readability of the section. We thank the reviewer for this suggestion.

      3) Transposable elements

      I found the predictions regarding the amount of TEs expected under asexuality quite ambiguous. From one side, TEs are expected not to spread because they cannot colonize new genomes (Hickey 1982); but on the other side TEs can be viewed as any deleterious mutation that will accumulate in asexual genome due to the Muller's ratchet. The argument provided by the authors to justify the expectation of low TE load in asexual lineages is that "Only asexual lineages without active TEs, or with efficient TE suppression mechanisms, would be able to persist over evolutionary timescales". But this argument should then equally be applied to any other type of deleterious mutations, and so we won't be able to see Muller's ratchet in the first place. Therefore, not observing the expected pattern for TEs in the genomic data is not so surprising as the expectation itself does not seem to be very robust. I would like the authors to better acknowledge this issue, which actually goes into their general idea that the genomic consequences of asexuality are not so simple.

      Indeed, the survivorship bias should affect all genomic features. Nothing that is incompatible with the viability of the species will ever be observed in nature. Perhaps the difference between Muller’s ratchet and the dynamics of accumulation of transposable elements (TEs) is that TEs are expected to either propagate very fast or not at all (Dolgin and Charlesworth 2006), while the effects of Muller’s ratchet are expected to vary among different populations and cellular mechanisms of asexuality. We will rephrase the text to better reflect the complexity of the predicted consequences of TE dynamics.

      4) Heterozygosity

      Due to the absence of recombination, asexual populations are expected to maintain a high level of diversity at each single locus (heterozygosity), but a low number of different haplotypes. However, as presented by the authors in the Box 2, there are different modes of parthenogenesis with different outcomes regarding heterozygosity: (1) preservation at all loci; (2) reduction or loss at all loci; (3) reduction depending on the chromosomal position relative to the centromere (distal or proximal). Therefore, the authors could benefit from their genome-based dataset to explore in more detail the distribution of heterozygosity along the chromosomes, and further test whether it fits with the above predictions. If the differing quality of the genome assemblies is an issue, the authors could at least provide the variance of the heterozygosity across the genome. The mode #3 (i.e. central fusions and terminal fusions) would be particularly interesting as one would then be able to compare, within the same genome, regions with large excess vs. deficit of heterozygosity and assess their evolutive impacts.

      Moreover, the authors should put more emphasis on the fact that using a single genome per species is a limitation to test the subtle effects of asexuality on heterozygosity (and also on "mutation accumulation & positive selection"). These effects are better detected using population-based methods (i.e. with many individuals, but not necessarily many loci). For example, the FIS value of a given locus is negative when its heterozygosity is higher than expected under random mating, and positive when the reverse is true (Wright 1951; 10.1111/j.1469-1809.1949.tb02451.x).

      We agree with the reviewer that the analysis of the distribution of heterozygosity along the chromosomes would be very interesting. However, the necessary data is available only for the Cape honey bee, and its analysis has been published by Smith et al. 2018. Calculating the probability distribution of heterozygosities would be possible, but it would require SNP calling for each of the datasets. Such an analysis would be computationally intensive and prone to biases by the quality of the genome assemblies.

      5) Absence of sexual lineages

      A second limit of this work is the absence of sexual lineages to use as references in order to control for lineage-specific effects. I do not agree with the authors when they say that "the theoretical predictions pertaining to mutation accumulation, positive selection, gene family expansions, and gene loss are always relative to sexual species [...] and cannot be independently quantified in asexuals." I think that this is true for all the genomic features analysed, because the transition to asexuality is going to affect the genome of asexual lineages relative to their sexual ancestors. This is actually acknowledged at the end of the Conclusion by the authors.

      To give an example, the authors say that "Species with an intraspecific origin of asexuality show low heterozygosity levels (0.03% - 0.83%), while all of the asexual species with a known hybrid origin display high heterozygosity levels (1.73% - 8.5%)". Interpreting these low vs. high heterozygosity values is difficult without having sexual references, because the level of genetic diversity is also heavily influenced by the long term life history strategies of each species (e.g. Romiguier et al. 2014; 10.1038/nature13685).

      I understand that the genome of related sexual species are not available, which precludes direct comparisons with the asexual species. However, I think that the results could be strengthened if the authors provided for each genomic feature that they tested some estimates from related sexual species. Actually, they partially do so along the Result & Discussion section for the palindromes, transposable elements and horizontal gene transfers. I think that these expectations for sexual species (and others) could be added to Table 1 to facilitate the comparisons.

      Our statement "the theoretical predictions pertaining to mutation accumulation, positive selection, gene family expansions, and gene loss are always relative to sexual species [...] and cannot be independently quantified in asexuals." specifically refers to methodology: analyses to address these predictions require orthologs between sexual and asexual species. We fully agree that in addition to methodological constraints, comparisons to sexual species are also conceptually relevant - which is in fact one of the major points of our paper. We will clarify these points.

      6) Regarding statistics, I acknowledge that the number of species analysed is relatively low (n=26), which may preclude getting any significant results if the effects are weak. However, the authors should then clearly state in the text (and not only in the reporting form) that their analyses are descriptive. Also, their position regarding this issue is not entirely clear as they still performed a statistical test for the effect of asexuality mode / origin on TE load (Figure 2 - supplement 1). Therefore, I would like to see the same statistical test performed on heterozygosity (Figure 2).

      We will unify the sections and add an appropriate test everywhere where suited.

      7) As you used 31 individuals from 26 asexual species, I was wondering whether you make profit of the multi-sample species. For example, were the kmer-based analyses congruent between individuals of the same species?

      Unfortunately, some of the 31 individuals do not have publicly available reads (some of the root-knot nematode datasets are missing), others do not have sufficient quality (the coverage for some water flea samples is very low). Our analyses were consistent for the few cases where we have multiple datasets available.

      References

      Dolgin, Elie S., and Brian Charlesworth. "The fate of transposable elements in asexual populations." Genetics 174.2 (2006): 817-827.

      Smith, Nicholas MA, et al. "Strikingly high levels of heterozygosity despite 20 years of inbreeding in a clonal honey bee." Journal of evolutionary biology 32.2 (2019): 144-152.

    1. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      This study by Louka et al., investigates the function of Cep104, a protein associated with Joubert syndrome, in Xenopus. Several aspects are studied at different scales. Loss of function of this protein suggests a role in neural tube closure, apical constriction, and HH signaling. Moving on in the study, the authors investigate the localization of Cep104 in the primary cilia of the neural tube before focusing on its localization in multiciliated cells. They then look at the consequences of loss of function on motile cilia and conclude that it plays a role in the length of the distal segment. They then show an association of Cep104 with cytoplasmic microtubules in non-multiciliated cells of the Xenopus epidermis. They then analyze the function of Cep104 on these microtubules and show that loss of Cep104 function increases the speed of EB1 comets. They then looked at the impact of loss of function on microtubule stability and finally the impact of gain of function. Finally, they returned to the multiciliated cells and described an intercalation defect that correlated with decreases in acetylated tubulin. I think that certain controls are missing and that the choice of illustrations should be reconsidered (better quality, appropriate zoom). In terms of form, the text is not easy to read and the manuscript would benefit from reformatting to highlight the logical links between the different experiences and avoid a catalog-like effect. I would advise the authors to revise their introduction to make it less disjointed and guide readers toward the questions addressed by the manuscript.

      Below are specific comments and remarks:

      Figure 1:

      Why the conclusion is a "delay" in neural tube closure? At what stage is this analyzed? Is there a recovery of NT closure at later stage? A: I would suggest to provide control picture of non-injected and tracer only injected embryos. B: Statistics are missing on the graph D: mention what was injected instead of "+ rescue". Close up picture would allow a better appreciation of the differences in surface area.

      Figure S1:

      To illustrate the claim that cilia are not affected, it would be good to show injection of tracer alone and compare to tracer + morpholino. Also, to provide a measure of the cilia size.

      Figure 2:

      Please provide pictures to illustrate graph D.

      Figure 5:

      "Interestingly, most of the nocodazole-resistant stable microtubules were positive for Cep104 (Figure 5C, arrows). " - The variation in density of Cep104-GFP signal is not visible on the pictures provided in C. I would suggest to show higher magnifications. Also, in the DMSO treated picture the Cep104GFP signal looks really different when compared to Cep104-GFP signal shown in B. Arrows should be reported on all channels. However, it not clear what we should see with this arrows. 5C: it seems that in nocodazole treated condition the Cep104-GFP is at the cilia base in MCCs which is different from the DMSO control condition. The basal body signal was not seen in the figure 3A which analyze the localization of Cep104-GFP in MCCs. Why not comment on this? Is it a phenotype on MCCs ? Figure 6: Intriguingly, morphant non-MCCs have significantly more mean β-tubulin signal compared to control non-MCCs in embryos treated with DMSO (Figure 6C). - impossible to appreciate on the figures. Please specify on the figure what is considered as a morphant non-MCC versus a control non-MCC. The membrane-cherry positive cells (supposedly morphant? it has to be clarified show very heterogenous tubulin expression)

      If the point here is to show that microtubules are more sensitive to nocodazole in morphant cells as compared to control. I would suggest to show all conditions on a same graph. At least annotate more the grap for a self-explanatory figure (DMSO , Nocodazole). Figure 7: Statistics are missing on Graph B Comment on the text: "Cep104 signal shows the characteristic two dot pattern in motile cilia (Figure 3A) that was also observed in a recent study using Xenopus Cep10465 and in the cilia of Tetrahymena50. This is in agreement with a recent study showing the characteristic two dot pattern for Xenopus Cep104 as well66 " - ref 65 and 66b are the same (Hong et al., preprint)

      "This data suggests that downregulation of CEP104 affects the stability of cytoplasmic microtubules." - I would suggest a more precise conclusion by stating how is it affected? More stable? Less stable? Important for the follow-up demonstration.

      Movies:

      Please annotate properly movie 2 and 3 so the reader can know what he/she is looking.

      Referees cross-commenting

      Similar feeling that reviews are consistent

      Significance

      This study investigates the role of the proprotein Cep104 in Xenopus. Cep104 is a protein associated with Joubert syndrome, whose role in primary cilia has been extensively documented. While its localization at the tip of motile cilia has also been reported, this study provides functional evidence for the role of Cep104 in motile cilia. In addition, the study looks at the role of Cep104 on non-cilial microtubules, which is the original aspect of the paper and may ultimately lead to a better understanding of Joubert syndrome. However, I believe that the evidence provided (controls, illustrations) needs to be improved. This paper will be of interest to a specialized audience with an interest in proteins associated with cilia and microtubules.

      I am a cell biologist specialized in the study of multiciliated cells using advanced imaging methods and Xenopus and mice as models. I believe my expertise was a perfect match for this manuscript.

    1. Author Response:

      Reviewer #1 (Public Review):

      1. There was little comment on the strategy/mechanism that enabled subjects to readily attain Target I (MU 1 active alone), and then Target II (MU1 and MU2 active to the same relative degree). To accomplish this, it would seem that the peak firing rate of MU1 during pursuit of Target II could not exceed that during Target I despite an increased neural drive needed to recruit MU2. The most plausible explanation for this absence of additional rate coding in MU1 would be that associated with firing rate saturation (e.g., Fuglevand et al. (2015) Distinguishing intrinsic from extrinsic factors underlying firing rate saturation in human motor units. Journal of Neurophysiology 113, 1310-1322). It would be helpful if the authors might comment on whether firing rate saturation, or other mechanism, seemed to be at play that allowed subjects to attain both targets I and II.

      To place the cursor inside TII, both MU1 and MU2 must discharge action potentials at their corresponding average discharge rate during 10% MVC (± 10% due to the target radius and neglecting the additional gain set manually in each direction). Therefore, subjects could simply exert a force of 10% MVC to reach TII and would successfully place the cursor inside TII. However, to get to TI, MU1 must discharge action potentials at the same rate as during TII hits (i.e. average discharge rate at 10% MVC) while keeping MU2 silent. Based on the performance analysis in Fig 3D, subjects had difficulties moving the cursor towards TI when the difference in recruitment threshold between MU1 and MU2 was small (≤ 1% MVC). In this case, the average discharge rate of MU1 during 10% MVC could not be reached without activating MU2. As could be expected, reaching towards TI became more successful when the difference in recruitment threshold between MU1 and MU2 was relatively large (≥3% MVC). In this case, subjects were able to let MU1 discharge action potentials at its average discharge rate at 10% MVC without triggering activation of MU2 (it seems the discharge rate of MU1 saturated before the onset of MU2). Such behaviour can be observed in Fig. 2A. MUs with a lower recruitment threshold saturate their discharge rate before the force reaches 10% MVC. We adapted the Discussion accordingly to describe this behaviour in more detail.

      1. Figure 4 (and associated Figure 6) is nice, and the discovery of the strategy used by subjects to attain Target III is very interesting. One mechanism that might partially account for this behavior that was not directly addressed is the role inhibition may have played. The size principle also operates for inhibitory inputs. As such, small, low threshold motor neurons will tend to respond to a given amount of inhibitory synaptic current with a greater hyperpolarization than high threshold units. Consequently, once both units were recruited, subsequent gradual augmentation of synaptic inhibition (concurrent with excitation and broadly distributed) could have led to the situation where the low threshold unit was deactivated (because of the higher magnitude hyperpolarization), leaving MU2 discharging in isolation. This possibility might be discussed.

      We agree with the reviewer’s comment that inhibition might have played a critical role in succeeding to reach TIII. Hence, we have added this concept to our discussion.

      1. In a similar vein as for point 2 (above), the argument that PICs may have been the key mechanism enabling the attainment of target III, while reasonable, also seems a little hand wavy. The problem with the argument is that it depends on differential influences of PICs on motor neurons that are 1) low threshold, and 2) have similar recruitment thresholds. This seems somewhat unlikely given the broad influence of neuromodulatory inputs across populations of motor neurons.

      We agree with the reviewer’s point and reasoning that a mixture of neuromodulation and inhibition likely introduced the variability in MU activity we observed in this study. This comment is addressed in the answer to comment 3.

      Reviewer #2 (Public Review):

      [...]

      1. Some subjects seemed to hit TIII by repeatedly "pumping" the force up and down to increase the excitability of MU2 (this appears to happen in TIII trials 2-6 in Fig. 4 - c.f. p18 l30ff). It would be useful to see single-trial time series plots of MU1, MU2, and force for more example trials and sessions, to get a sense for the diversity of strategies subjects used. The authors might also consider providing additional analyses to test whether multiple "pumps" increased MU2 excitability, and if so, whether this increase was usually larger for MU2 than MU1. For example, they might plot the ratio of MU2 (and MU1) activation to force (or, better, the residual discharge rate after subtracting predicted discharge based on a nonlinear fit to the ramp data) over the course of the trial. Is there a reason to think, based on the data or previous work, that units with comparatively higher thresholds (out of a sample selected in the low range of <10% MVC) would have larger increases in excitability?


      We added a supplementary figure (Supplement 4) that visualizes additional trials from different conditions and subjects for TIII-instructed trials and noted this in the text.

      MU excitability might indeed be pronounced during repeated activations within a couple of seconds (see, for example, M. Gorassini, J. F. Yang, M. Siu, and D. J. Bennett, “Intrinsic Activation of Human Motoneurons: Reduction of Motor Unit Recruitment Thresholds by Repeated Contractions,” J. Neurophysiol., vol. 87, no. 4, pp. 1859–1866, 2002.). Such an effect, however, seems to be equally distributed to all active MUs. Moreover, we are not aware of any recent studies suggesting that MUs, within the narrow range of 0-10% MVC, may be excited differently by such a mechanism. Supplement 4C and D illustrate trials in which subjects performed multiple “pumps”. Visually, we could not find changes in the excitability specific to any of the two MUs nor that subjects explored repeated activation of MUs as a strategy to reach TIII. It seems subjects instead tried to find the precise force level which would allow them to keep MU2 active after the offset of MU1. We further discussed that PICs act very broadly on all MUs. The observed discharge patterns when successfully reaching TIII may likely be due to an interplay of broadly distributed neuromodulation and locally acting synaptic inhibition.

      1. I am somewhat surprised that subjects were able to reach TIII at all when the de-recruitment threshold for MU1 was lower than the de-recruitment threshold for MU2. It would be useful to see (A) performance data, as in Fig. 3D or 5A, conditioned on the difference in de-recruitment thresholds, rather than recruitment thresholds, and (B) a scatterplot of the difference in de-recruitment vs the difference in recruitment thresholds for all pairs.


      We agree that comparing the difference in de-recruitment threshold with the performance of reaching each target might provide valuable insights into the strategies used to perform the tasks. Hence, we added this comparison to Figure 4E at p. 16, l. 1. A scatterplot of the difference in de-recruitment threshold and the difference in recruitment threshold has been added to Supplement 3A. The Results section was modified in line with the above changes.

      1. Using MU1 / MU2 rates to directly control cursor position makes sense for testing for independent control over the two MUs. However, one might imagine that there could exist a different decoding scheme (using more than two units, nonlinearities, delay coordinates, or control of velocity instead of position) that would allow subjects to generate smooth trajectories towards all three targets. Because the authors set their study in a BCI context, they may wish to comment on whether more complicated decoding schemes might be able to exploit single-unit EMG for BCI control or, alternatively, to argue that a single degree of freedom in input fundamentally limits the utility of such schemes.


      This study aimed to assess whether humans can learn to decorrelate the activity between two MUs coming from the same functional MU pool during constraint isometric conditions. The biofeedback was chosen to encourage subjects to perform this non-intuitive and unnatural task. Transferring biofeedback on single MUs into an application, for example, BCI control, could include more advanced pre-processing steps. Not all subjects were able to navigate the cursor along both axes consistently (always hitting TI and TIII). However, the performance metric (Figure 4C) indicated that subjects became better over time in diverging from the diagonal and thus increased their moving range inside the 2D space for various combinations of MU pairs. Hence, a weighted linear combination of the activity of both MUs (for example, along the two principal components based on the cursor distribution) may enable subjects to navigate a cursor from one axis to another. Similarly, coadaptation methods or different types of biofeedback (auditory or haptic) may help subjects. Furthermore, using only two MUs to drive a cursor inside a 2-D space is prone to interference. Including multiple MUs in the control scheme may improve the performance even in the presence of noise. We have shown that the activation of a single MU pool exposed to a common drive does not necessarily obey rigid control. State-dependent flexible control due to variable intrinsic properties of single MUs may be exploited for specific applications, such as BCI. However, further research is necessary to understand the potentials and limits of such a control scheme.

      1. The conclusions of the present work contrast somewhat with those of Marshall et al. (ref. 24), who claim (for shoulder and proximal arm muscles in the macaque) that (A) violations of the "common drive" hypothesis were relatively common when force profiles of different frequencies were compared, and that (B) microstimulation of different M1 sites could independently activate either MU in a pair at rest. Here, the authors provide a useful discussion of (A) on p19 l11ff, emphasizing that independent inputs and changes in intrinsic excitability cannot be conclusively distinguished once the MU has been recruited. They may wish to provide additional context for synthesizing their results with Marshall et al., including possible differences between upper / lower limb and proximal / distal muscles, task structure, and species.

      The work by Marshall, Churchland and colleagues shows that when stimulating focally in specific sites in M1 single MUs can be activated, which may suggest a direct pathway from cortical neurons to single motor neurons within a pool. However, it remains to be shown if humans can learn to leverage such potential pathways or if the observations are limited to the artificially induced stimulus. The tibialis anterior receives a strong and direct cortical projection. Thus, we think that this muscle may be well suited to study whether subjects can explore such specific pathways to activate single MUs independently. However, it may very well be that the control of upper limbs show more flexibility than lower ones. However, we are not aware of any study that may provide evidence for a critical mismatch in the control of upper and lower limb MU pools. We have added this discussion to the manuscript.

      Reviewer #3 (Public Review):

      [...]

      Even if the online decomposition of motor units were performed perfectly, the visual display provided to subject smooths the extracted motor unit discharge rates over a very wide time window: 1625 msec. This window is significantly larger than the differences in recruitment times in many of the motor unit pairs being used to control the interface. So while it's clear that the subjects are learning to perform the task successfully, it's not clear to me that subjects could have used the provided visual information to receive feedback about or learn to control motor unit recruitment, even if individuated control of motor unit recruitment by the nervous system is possible. I am therefore not convinced that these experiments were a fair test of subjects' ability to control the recruitment of individual motor units.

      Regarding the validating of isolating motor units in the conditions analysed in this study, we have added a full new set of measurements with concomitant surface and intramuscular recordings during recruitment/derecruitment of motor units at variable recruitment speed. This provides a strong validation of the approach and of the accuracy of the online decomposition used in this study. Subjects received visual feedback on the activity of the selected MU pair, i.e. discharge behaviour of both MUs and the resulting cursor movement. This information was not clear from the initial submission and hence, we annotated the current version to clarify the biofeedback modalities. To further clarify the decoding of incoming MU1/MU2 discharge rates into cursor movement, we included Supplement 2. We also included a video that shows that the smoothing window on the cursor position does not affect the immediate cursor movement due to incoming spiking activity. For example, as shown in Supplement 2, for the initial offset of 0ms, the cursor starts moving along the axis corresponding to a sole activation of MU1 and immediately diverges from this axis when MU2 starts to discharge action potentials. We, therefore, think that the biofeedback provided to the subjects does allow exploration of single MU control.

      Along similar lines, it seems likely to me that subjects are using some other strategy to learn the task, quite possibly one based on control of over overall force at the ankle and/or voluntary recruitment of other leg/foot muscles. Each of these variables will presumably be correlated with the activity of the recorded motor units and the movement of the cursor on the screen. Moreover, because these variables likely change on a similar (or slower) timescale than differences in motor units recruitment or derecruitment, it seems to me that using such strategies, which do not reflect or require individuated motor unit recruitment, is a highly effective way to successfully complete the task given the particular experimental setup.

      In addition to being seated and restricted by an ankle dynamometer, subjects were instructed to only perform dorsiflexion of the ankle. Further, none of the subjects reported compensatory movements as a strategy to reach any of the targets. In addition, to be successfully utilised, such compensatory movements would need to influence various combinations of MUs tested in this study equally, even when they differ in size. Nevertheless, we acknowledge, as pointed out by the reviewer, that our setup has limitations. We only measured force in a single direction (i.e. ankle dorsiflexion) and did not track toe, hip or knee movements. Even though an instructor supervised leg movement throughout the experiment, it may be that very subtle and unknowingly compensatory movements have influenced the activity of the selected MUs. Hence, we updated the limitations section in the Discussion.

      To summarize my above two points, it seems like the author's argument is that absence of evidence (subjects do not perform individuated MU recruitment in this particular task) constitutes evidence of absence (i.e. is evidence that individuated recruitment is not possible for the nervous system or for the control of brain-machine interfaces). Therefore given the above-described issues regarding real-time feedback provided to subjects in the paper it is not clear to me that any strong conclusions can be drawn about the nervous system's ability or inability to achieve individuated motor unit recruitment.

      We hope that the above changes clarify the biofeedback modalities and their potential to provide subjects with the necessary information for exploring independent MU control. Our experiments aimed to investigate whether subjects can learn under constraint isometric conditions to decorrelate the activity between two MUs coming from the same functional pool. While it seemed that MU activity could be decorrelated, this almost exclusively happened (TIII-instructed trials) within a state-dependent framework, i.e. both MUs must be activated first before the lower threshold one is switched off. We did not observe flexible MU control based exclusively on a selective input to individual MUs (MU2 activated before MU1 during initial recruitment). That does not mean that such control is impossible. However, all successful control strategies that were voluntarily explored by the subjects to achieve flexible control were based on a common input and history-dependent activation of MUs. We have added these concepts to the discussion section.

      Second, to support the claims based on their data the authors must explain their online spike-sorting method and provide evidence that it can successfully discriminate distinct motor unit onset/offset times at the low latency that would be required to test their claims. In the current manuscript, authors do not address this at all beyond referring to their recent IEEE paper (ref [25]). However, although that earlier paper is exciting and has many strengths (including simultaneous recordings from intramuscular and surface EMGs), the IEEE paper does not attempt to evaluate the performance metrics that are essential to the current project. For example, the key metric in ref 25 is "rate-of-agreement" (RoA), which measures differences in the total number of motor unit action potentials sorted from, for example, surface and intramuscular EMG. However, there is no evaluation of whether there is agreement in recruitment or de-recruitment times (the key variable in the present study) for motor units measured both from the surface and intramuscularly. This important technical point must be addressed if any conclusions are to be drawn from the present data.

      We have taken this comment in high consideration, and we have performed a validation based on concomitant intramuscular and surface EMG decomposition in the exact experimental conditions of this study, including variations in the speed of recruitment and de-recruitment. This new validation fully supports the accuracy in of the methods used when detecting recruitment and de-recruitment of motor units.

      My final concern is that the authors' key conclusion - that the nervous system cannot or does not control motor units in an individuated fashion - is based on the assumption that the robust differences in de-recruitment time that subjects display cannot be due to differences in descending control, and instead must be due to changes in intrinsic motor unit excitability within the spinal cord. The authors simply assert/assume that "[derecruitment] results from the relative intrinsic excitability of the motor neurons which override the sole impact of the receive synaptic input". This may well be true, but the authors do not provide any evidence for this in the present paper, and to me it seems equally plausible that the reverse is true - that de-recrutiment might influenced by descending control. This line of argumentation therefore seems somewhat circular.

      When subjects were asked to reach TIII, which required the sole activation of a higher threshold MU, subjects almost exclusively chose to activate both MUs first before switching off the lower threshold MU. It may be that the lower de-recruitment threshold of MU2 was determined by descending inputs changing the excitability of either MU1 or MU2 (for example, see J. Nielsen, C. Crone, T. Sinkjær, E. Toft, and H. Hultborn, “Central control of reciprocal inhibition during fictive dorsiflexion in man,” Exp. brain Res., vol. 104, no. 1, pp. 99–106, Apr. 1995 or E. Jankowska, “Interneuronal relay in spinal pathways from proprioceptors,” Prog. Neurobiol., vol. 38, no. 4, pp. 335–378, Apr. 1992). Even if that is the case, it remains unknown why such a command channel that potentially changes the excitability of a single MU was not voluntarily utilized at the initial recruitment to allow for direct movement towards TIII (as direct movement was preferred for TI and TII). We cannot rule out that de-recruitment was affected by selective descending commands. However, our results match observations made in previous studies on intrinsic changes of MU excitability after MU recruitment. Therefore, even if descending pathways were utilized throughout the experiment to change, for example, MU excitability, subjects were not able to explore such pathways to change initial recruitment and achieve general flexible control over MUs. The updated discussion explains this line of reasoning.

      Reviewer #4 (Public Review):

      [...]

      1. Figure 6a nicely demonstrates the strategy used by subjects to hit target TIII. In this example, MU2 was both recruited and de-recruited after MU1 (which is the opposite of what one would expect based on the standard textbook description). The authors state (page 17, line 15-17) that even in the reverse case (when MU2 is de-recruited before MU1) the strategy still leads to successful performance. I am not sure how this would be done. For clarity, the authors could add a panel similar to panel A to this figure but for the case where the MU pairs have the opposite order of de-recruitment.

      We have added more examples of successful TIII-instructed trials in Supplement 4. Supplement 4C and D illustrate examples of subjects navigating the cursor inside TIII even when MU2 was de-recruited before MU1. As exemplarily shown, subjects also used the three-stage approach discussed in the manuscript. In contrast to successful trials in which MU2 was de-recruited after MU1 (for example, Supplement 4B), subjects required multiple attempts until finding a precise force level that allowed a continuous firing of MU2 while MU1 remained silent. We have added a possible explanation for such behaviour in the Discussion.

      1. The authors discuss a possible type of flexible control which is not evident in the recruitment order of MUs (page 19, line 27-28). This reasoning was not entirely clear to me. Specifically, I was not sure which of the results presented here needs to be explained by such mechanism.

      We have shown that subjects can decorrelate the discharge activity of MU1 and MU2 once both MUs are active (e.g. reaching TIII). Thus, flexible control of the MU pair was possible after the initial recruitment. Therefore, this kind of control seems strongly linked to a specific activation state of both MUs. We further elaborated on which potential mechanisms may contribute to this state-dependent control.

      1. The authors argue that using a well-controlled task is necessary for understanding the ability to control the descending input to MUs. They thus applied a dorsi-flexion paradigm and MU recordings from TA muscles. However, it is not clear to what extent the results obtained in this study can be extrapolated to the upper limb. Controlling the MUs of the upper limb could be more flexible and more accessible to voluntary control than the control of lower limb muscles. This point is crucial since the authors compare their results to other studies (Formento et al., bioRxiv 2021 and Marshall et al., bioRxiv 2021) which concluded in favor of the flexible control of MU recruitment. Since both studies used the MUs of upper limb muscles, a fair comparison would involve using a constrained task design but for upper limb muscles.

      We agree with the reviewer that our work differs from previous approaches, which also studied flexible MU control. We, therefore, added a paragraph to the limitation section of the Discussion.

      1. The authors devote a long paragraph in the discussion to account for the variability in the de-recruitment order. They mostly rely on PIC, but there is no clear evidence that this is indeed the case. Is it at all possible that the flexibility in control over MUs was over their recruitment threshold? Was there any change in de-recruitment of the MUs during learning (in a given recording session)?

      The de-recruitment threshold did not critically change when compared before and after the experiment on each day (difference in de-recruitment threshold before and after the experiment: -0.16 ± 2.28% MVC, we have now added this result to the Results section). Deviations from the classical recruitment order may be achieved by temporal (short-lived) changes in the intrinsic excitability of single MUs. We, therefore, extended our discussion on potential mechanisms that may explain the observed variability given all MUs receive the same common input.

      1. The need for a complicated performance measure (define on page 5, line 3-6) is not entirely clear to me. What is the correlation between this parameter and other, more conventional measures such as total-movement time or maximal deviation from the straight trajectory? In addition, the normalization process is difficult to follow. The best performance was measured across subjects. Does this mean that single subject data could be either down or up-regulated based on the relative performance of the specific subject? Why not normalize the single-subject data and then compare these data across subjects?

      We employed this performance metric to overcome shortcomings of traditional measures such as target hit count, time-to-target or deviation from the straight trajectory. Such problems are described in the illustration below for TIII-instructed trials (blue target). A: the duration of the trial is the same in both examples (left and right); however, on the left, the subject manages to keep the cursor close to the target-of-interest while on the right, the cursor is far away from the target centre of TIII. B: In both images the cursor has the same distance d to the target centre of TIII. However, on the left, the subject manages to switch off MU1 while keeping MU2 active, while on the right, both MUs are active. C: On the left, the subject manages to move the cursor inside the TIII before the maximum trial time was reached, while on the right, the subject moved the cursor up and down, not diverging from the ideal trajectory to the target centre but fails to place the cursor inside TIII within the duration of the trial. In all examples, using only one conventional measure fails to account for a higher performance value in the left scenario than in the right. Our performance metric combines several performance metrics such as time-to-target, distance from the target centre, and the discharge rate ratio between MU1 and MU2 via the angle 𝜑 and thus allows a more detailed analysis of the performance than conventional measures. The normalisation of the performance value was done to allow for a comparison across subjects. The best and worst performance was estimated using synthetic data mimicking ideal movement towards each target (i.e. immediate start from the target origin to the centre of the target, while the normalised discharge rate of the corresponding MU is set to 1). Since the target space is normalised for all subjects in the same manner (mean discharge rate of the corresponding MUs at 10 %MVC) this allows us to compare the performance between subjects, conditions and targets.

      1. Figure 3C appears to indicate that there was only moderate learning across days for target TI and TII. Even for target TIII there was some improvement but the peak performance in later days was quite poor. The fact that the MUs were different each day may have affected the subjects' ability to learn the task efficiently. It would be interesting to measure the learning obtained on single days.

      We have added an analysis that estimated the learning within a session per subject and target (Supplement 3C). In order to evaluate the strength of learning within-session, the Spearman correlation coefficient between target-specific performance and consecutive trials was calculated and averaged across conditions and days. The results suggest that there was little learning within sessions and no significant difference between targets. These results have now been added to the manuscript.

      1. On page 16 line 12-13, the authors describe the rare cases where subjects moved directly towards TIII. These cases apparently occurred when the recruitment threshold of MU2 was lower. What is the probable source of this lower recruitment level in these specific trials? Was this incidental (i.e., the trial was only successful when the MU threshold randomly decreased) or was there volitional control over the recruitment threshold? Did the authors test how the MU threshold changed (in percentages) over the course of the training day?

      We did not track the recruitment threshold throughout the session but only at the beginning and end. We could not identify any critical changes in the recruitment order (see Results section). However, our analysis indicated that during direct movements towards TIII, MU2 (higher threshold MU) was recruited at a lower force level during the initial ramp and thus had a temporary effective recruitment threshold below MU1. It is important to note that these direct movements towards TIII only occurred for pairs of MUs with a similar recruitment threshold (see Figure 6). One possible explanation for this temporal change in recruitment threshold could be altered excitability due to neuromodulatory effects such as PICs (see Discussion). We have added an analysis that shows that direct movements towards TIII occurred in most cases (>90%) after a preceding TII- or TIIIinstructed trial. Both of these targets-of-interest require activation of MU2. Thus, direct movement towards TIII was likely not the result of specific descending control. Instead, this analysis suggests that the PIC effect triggered at the preceding trial was not entirely extinguished when a trial ending in direct movement towards TIII started. Alternatively, the rare scenarios in which direct movements happened could be entirely random. Similar observations were made in previous biofeedback studies [31]. To clarify these points, we altered the manuscript.

    1. Author Response

      Reviewer #2 (Public Review):

      Summary: This substantial collaborative effort utilized virus-based retrograde tracing from cervical, thoracic and lumbar spinal cord injection sites, tissue clearing and cutting-edge imaging to develop a supraspinal connectome or map of neurons in the brain that project to the spinal cord. The need for such a connectome-atlas resource is nicely described, and the combination of the actual data with the means to probe that data is truly outstanding.

      They then compared the connectome from intact mice to those of mice with mild, moderate and severe spinal cord injuries to reveal the neuronal populations that retain axons and synapses below the level of injury. Finally, they look for correlations between the remaining neuronal populations and functional recovery to reveal which are likely contributing to recovery and its variability after injury. Overall, they successfully achieve their primary goals with the following caveats: The injury model chosen is not the most widely employed in the field, and the anatomical assessment of the injuries is incomplete/not ideal.

      Concerns/issues:

      1) I would like to see additional discussion/rationale for the chosen injury model and how it compares to other more commonly employed animal models and clinical injuries. Please relate how what is being observed with the supraspinal connectome might be different for these other models and for clinical injuries.

      We have added text to the Results and Discussion to explain our rationale for selecting the crush injury model, and to acknowledge differences between this model and more clinically relevant contusion models. (Results: line 360-364, Discussion 608-615). We agree wholeheartedly that a critical future direction will be to deploy brain-wide quantification in contusion models, and we are currently seeking funding to obtain the needed equipment.

      2) The assessment of the thoracic injuries employed is not ideal because it provides no anatomical description of spared white matter (or numbers of spared axons) at the injury epicenter.

      We address this more fully in the related point below. Briefly, we agree with a need to improve the assessment of the lesion but are hampered by tissue availability. We are unable to assess white matter sparing but can offer quantification of the width of residual astrocyte tissue bridges in four spinal sections from each animal (new Figure 5 – figure supplement 3). As discussed below, however, we recognize the limitations of the lesion assessment and agree with the larger point that the current quantification methods do not position us to make claims about the relative efficacy of spinal injury analyses versus whole-brain sparing analyses to stratify severity or predict outcomes. Our approach should be seen as a complement, not a substitute, for existing lesion-based analyses. We have edited language throughout the manuscript to make this position clearer.

      3) Related to this, but an issue that requires separate attention is the highly variable appearance of the injury and tracer/virus injection sites, the variability in the spatial relationship with labeled neurons (lumbar) and how these differences could influence labeling, sprouting of axons of passage and interpretation of the data. In particular this is referring to the data shown in Figure 6 (and related data).

      It is true that there is some variability in the relative position of the injury and injection, a surgical reality. The degree of variability was perhaps exaggerated in the original Figure 6 (Now Figure 5), in which one image came from one of two animals in the cohort with a notably larger gap between the injury and injection. Nevertheless, this comment raises the important question of how variability in injection-to-injury distance might affect supraspinal label. First, we would emphasize the data in Figure 1 – Figure Supplement 6, in which we showed that the number of retrogradely labeled supraspinal neurons is relatively stable as injection sites are deliberately varied across the lower thoracic and lumbar cord. Indeed, the question raised here is precisely the reason we performed this early test to determine how sensitive the results might be to shifts in segmental targeting. The results indicate that retrograde labeling is fairly insensitive to L1 versus L4 targeting. As an additional check for this specific experiment we also measured the distance between the rostral spread of viral label and the caudal edge of the lesion and plotted it against the total number of retrogradely labeled neurons in the brain. If a smaller injury/injection gap favored more labeling we might expect negative correlation, but none is apparent. We conclude that although the injury/injection distance did vary in the experiment, it likely did not exert a strong influence on retrograde labeling.

      Reviewer #3 (Public Review):

      In this manuscript, Wang et al describe a series of experiments aimed at optimizing the experimental and computational approach to the detection of projection-specific neurons across the entire mouse brain. This work builds on a large body of work that has developed nuclear-fused viral labelling, next-generation fluorophores, tissue clearing, image registration, and automated cell segmentation. They apply their techniques to understand projection-specific patterns of supraspinal neurons to the cervical and lumbar spinal cord, and to reveal brain and brainstem connections that are preferentially spared or lost after spinal cord injury.

      Strengths:

      Although this work does not put forward any fundamentally new methodologies, their careful optimization of the experimental and quantification process will be appreciated by other laboratories attempting to use these types of methods. Moreover, the observations of topological arrangement of various supraspinal centres are important and I believe will be interesting to others in the field.

      The web app provided by the authors provides a nice interface for users to explore these data. I think this will be appreciated by people in the field interested in what happens to their brain or brainstem region of interest.

      Weaknesses:

      Overall the work is well done; however, some of the novelty claims should be better aligned with the experimental findings. Moreover, the statistical approaches put forward to understand the relationship between spinal cord injury severity and cell counts across the mouse brain needs to be more carefully considered.

      The authors state that they provide an experimental platform for these types of analysis to be done. My apologies if I missed it but I could not find anywhere the information on viral construct availability or code availability to reproduce the results. Certainly both of these aspects would be required for people to replicate the pipeline. Moreover, the described methodology for imaging and processing is quite sparse. While I appreciate that this information is widely provided in papers that have developed these methods, I do not think it is appropriate to claim to have provided a platform for people to enable these types of analyses without a more in-depth description of the methods. Alternatively, the authors could instead focus on how they optimized current methodologies and avoid the overstatement that this work provides a tool for users. The exception to this is of course the viral constructs, the plasmids of which should be deposited.

      We agree that we have not provided a tool per se, more of an example that could be followed. We have revised language in the abstract, introduction, and discussion to make it clear that we optimized existing methods and provide an example of how this can be done, but are not offering a “plug and play” solution to the problem of registration that would, for example, allow upload of external data. For example, in the abstract we replaced “We now provide an experimental platform” with “Here we assemble an experimental workflow.” (Line 28). The term “platform” no longer appears in the manuscript and has been replaced throughout by “example.” We how this matches the intention of the comment and are happy to revise further as needed. Note that the plasmids have been deposited to Addgene.

      It was not completely to me clear why or when the authors switch back and forth between different resolutions throughout the manuscript. In the abstract it states that 60 regions were examined, but elsewhere the number is as many as 500. My understanding is that current versions of the Allen Brain Annotation include more than 2000 regions. I think it would make things clear for the readers if a single resolution was used throughout, or at least justified narratively throughout the text to avoid confusion.

      Thank you for pointing this out. The Cellfinder application recognizes 645 discrete regions in the brain, and across all experiments we detected supraspinal nuclei in 69 of these. This number, however, includes some very fine distinctions, for example three separate subregions of vestibular nuclei, three subregions of the superior olivary complex, etc. True experts may desire this level of information, but with the goal of accessibility we find it useful to collapse closely related / adjacent regions to an umbrella term. Doing so generates a list of 25 grouped or summary regions. In the revised version we move the 69-region data completely to the supplemental data (there for the experts who wish to parse), and use the consistent 25-region system (plus cervical spinal cord in later sections) to present data in the main figures. We have added text to the Results section (lines 157-162) to clarify this grouping system.

      The others provide an interesting analysis of the difference between cervical and lumbar projections. I think this might be one of the more interesting aspects of the paper - yet I found myself a bit confused by the analysis, and whether any of the differences observed were robust. Just prior to this experiment the authors provide a comparison of the mScarlet vs. the mGL, and demonstrate that mGL may label more cells. Yet, in the cervical vs. lumbar analysis it appears they are being treated 1 to 1. Moreover, I could not find any actual statistical analysis of this data? My impression would be that given the potential difference in labelling efficiency between the mScarlet and mGL this should be done using some kind of count analysis that takes into account the overall number of neurons labelled, such as a Chi-sq test or perhaps something more sophisticated. Then, with this kind of statistical analysis in place, do any of the discussed differences hold up? If not, I do not think this would detract from the interesting topological observations - but would call on the authors to be a bit more conservative about their statements and discussion regarding differences in the proportions of neurons projecting to certain supraspinal centers.

      This is an important point. In response to this input and related comments from other reviewers we performed new experiments to assess co-localization. The new data address the point above by including quantification of the degree of colocalization that results from titer-matched co-injection of the two fluorophores, providing baseline data. The results of this can be found in Figure 3 – figure supplement 3 and form the basis for statistical comparisons to experimental animals shown in Figure 3.

      Finally, I do have some concerns about the author's use of linear regression in their analysis of brain regions after varying severities of SCI. First of all, the BMS score is notoriously non-linear. Despite wide use of linear regressions in the field to attempt to associate various outcomes to these kinds of ordinal measures, this is not appropriate. Some have suggested a rank conversion of the BMS prior to linear analyses, but even this comes with its own problems. Ultimately, the authors have here 2-3 clear cohorts of behavioral scores and drawing a linear regression between these is unlikely to be robustly informative. Moreover, it is unclear whether the authors properly adjusted their p-values from running these regressions on 60 (600?) regions. Finally, the statement in the abstract and discussion that the authors "explain more variability" compared to typical lesion severity analysis is also unsupported. My suggestion would be the following:

      Remove the linear regression analyses associated with BMS. I do not think these add value to the paper, and if anything provide a large window of false interpretation due to a violation of the assumptions of this test.

      Consider adding a more appropriate statistical analysis of the brain regions, such as a non-parametric group analysis. Knowing which brain regions are severity dependent, and which ones are not, would already be an interesting finding. This finding would not be confounded by any attempt to link it to crude measures of behavior.

      We agree that the linear regression approach was flawed and appreciate the opportunity to correct it. After consultation with two groups of statisticians we were forced to conclude that the data are simply underpowered for mixed model and ranking approaches. We therefore adopted a much simpler strategy. As you point out (and as noted by the statisticians), the behavioral data are bimodal; one group of animals regained plantar stepping ability, albeit with varying degrees of coordination (BMS 6-8), while the others showed at most rare plantar steps (BMS 0-3.5). We therefore asked whether the number of spared neurons in each brain region differed between the two groups and also examined the degree of “overlap” in the sparing values between the two groups. The data are now presented in Figure 6.

      If the authors would like to state anything about 'explaining more variability' then the proper statistical analysis should be used, which in this case would be to compare the models using a LRT or equivalent. However, as I mentioned it does not seem to be appropriate to be doing this with linear models so the authors should consider a non-linear equivalent if they choose to proceed with this.

      We thank the reviewer for the excellent suggestion. However as we explained above after consultation with two groups of statisticians we were forced to conclude that the data are underpowered and could not apply some of the methods suggested. Especially in light of our simplified analysis, we think it is better to remove any claims of the relative success of the sparing in different regions to explain more or less variability. Instead we can simply report that sparing in some regions, but not others, is significantly different between “low-performing” and “high-performing” groups.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, the authors find CpGs within 500Kb of a gene that associate with transcript abundance (cis-eQTMs) in children from the HELIX study. There is much to admire about this work. With two notable exceptions, their work is solid and builds/improves on the work that came before it. Their catalogue of eQTMs could be useful to many other researchers that utilize methylation data from whole blood samples in children. Their annotation of eQTMs is well thought out and exhaustive. As this portion of the work is descriptive, most of their methods are appropriate.

      Unfortunately, their use of results from a model that does not account for cell-type proportions across samples diminishes the utility and impact of their findings. I believe that their catalog of eQTMs contains a great deal of spurious results that primarily represent the differences in cell-type proportions across samples.

      Lastly, the authors postulate that the eQTM gene associations found uniquely in their unadjusted model (in comparison to results from a model that does account for cell type proportion) represent cell-specific associations that are lost when a fully-adjusted model is assumed. To test this hypothesis, the authors appear to repurpose methods that were not intended for the purposes used in this manuscript. The manuscript lacks adequate statistical validation to support their repurposing of the method, as well as the methodological detail needed to peer review it. This section is a distraction from an otherwise worthy manuscript. But provide evidences that enriched for cell sp CpGs.

      Major points

      1. Line 414-475: In this section, the authors are suggesting that CpGs that are significant without adjusting for cell type are due to methylation-expression associations that are found only in one cell type, while association found in the fully adjusted model are associations that are shared across the cell types. I do not agree with this hypothesis, as I do not agree that the confounding that occurs when cell-type proportions are not accounted for would behave in this way. Although restricting their search for eQTMs to only those CpGs proximal to a gene will reduce the number of spurious associations, a great deal of the findings in the authors' unadjusted model likely reflect differences in cell-type proportions across samples alone. The Reinius manuscript, cited in this paper, indicates that geneproximal CpGs can have methylation patterns that vary across cell types.

      Following reviewers’ recommendations, we have reconsidered our initial hypothesis about the role of cellular composition in the association between methylation and gene expression. Although we still think that some of the eQTMs only found in the model unadjusted for cellular composition could represent cell specific effects, we acknowledge that the majority might be confounded by the extensive gene expression and DNA methylation differences between cell types. Also, we recognize that more sophisticated statistical tests should be applied to prove our hypothesis. Because of this, we have decided to report the eQTMs of the model adjusted for cellular composition in the main manuscript and keep the results of the model unadjusted for cellular composition only in the online catalogue.

      1. Line 476-488: Their evidence due to F-statistics is tenuous. The authors do not give enough methodological detail to explain how they're assessing their hypothesis in the results or methods (lines 932-946) sections. The methods they give are difficult to follow. The results in figure S19A are not compelling. The citation in the methods (by Reinius) do not make sense, because Reinius et al did not use F-statistics as a proxy for cell type specificity. The citation that the authors give for this method in the results does not appear to be appropriate for this analysis, either. Jaffe and Irizarry state that a CpG with a high Fstatistic indicates that the methylation at that CpG varies across cell type. They suggest removing these CpGs from significant results, or estimating and correcting for cell type proportions, as their presence would be evidence of statistical confounding. The authors of this manuscript indicate that they find higher F-statistics among the eQTMs uniquely found in the unadjusted model, which seems to only strengthen the idea that the unadjusted model is suffering from statistical confounding.

      We recognize the miss-interpretation of the F-statistic in relation to cellular composition. We have deleted all this part from the updated version of the manuscript.

      1. The methods used to generate adjusted p-values in this manuscript are not appropriate as they are written. Further, they are nothing like the methods used in the paper cited by the authors. The Bonder paper used permutations to estimate an empirical FDR and cites a publication by Westra et al for their method (below). The Westra paper is a better one to cite, because the methods are more clear. Neither the Bonder nor the Westra paper uses the BH procedure for FDR.

      Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238-1243 (2013).

      We apologize for this misleading citation. Although Bonder et al applied a permutation approach to adjust for multiple testing, our approach was inspired by the method applied in the GTEx project (GTEx consortium, 2020), using CpGs instead of SNPs. The citation has been corrected in the manuscript. Moreover, we have explained in more detail the whole multiple-testing processes in the Material and Methods section (page 14, line 316):

      “To ensure that CpGs paired to a higher number of Genes do not have higher chances of being part of an eQTM, multiple-testing was controlled at the CpG level, following a procedure previously applied in the Genotype-Tissue Expression (GTEx) project (Gamazon et al., 2018). Briefly, our statistic used to test the hypothesis that a pair CpGGene is significantly associated is based on considering the lowest p-value observed for a given CpG and all its pairs Gene (e.g. those in the 1 Mb window centered at the TSS). As we do not know the distribution of this statistic under the null, we used a permutation test. We generated 100 permuted gene expression datasets and ran our previous linear regression models obtaining 100 permuted p-values for each CpG-Gene pair. Then, for each CpG, we selected among all CpG-Gene pairs the minimum p-value in each permutation and fitted a beta distribution that is the distribution we obtain when dealing with extreme values (e.g. minimum) (Dudbridge and Gusnanto, 2008). Next, for each CpG, we took the minimum p-value observed in the real data and used the beta distribution to compute the probability of observing a lower p-value. We defined this probability as the empirical p-value of the CpG. Then, we considered as significant those CpGs with empirical p-values to be significant at 5% false discovery rate using BenjaminiHochberg method. Finally, we applied a last step to identify all significant CpG-Gene pairs for all eCpGs. To do so, we defined a genome-wide empirical p-value threshold as the empirical p-value of the eCpG closest to the 5% false discovery rate threshold. We used this empirical p-value to calculate a nominal p-value threshold for each eCpG, based on the beta distribution obtained from the minimum permuted p-values. This nominal p-value threshold was defined as the value for which the inverse cumulative distribution of the beta distribution was equal to the empirical p-value. Then, for each eCpG, we considered as significant all eCpG-Gene variants with a p-value smaller than nominal p-value.”

      References:<br /> GTEx consortium, The GTEx Consortium atlas of genetic regulatory effects across human tissues, Science (2020) Sep 11;369(6509):1318-1330. doi: 10.1126/science.aaz1776.

      Reviewer #2 (Public Review):

      Strength:

      Comprehensive analysis Considering genetic factors such as meQTL and comparing results with adult data are interesting.

      We thank the reviewer for his/her positive feedback on the manuscript. We agree that the analysis of genetic data and the comparison with eQTMs described in adults are two important points of the study.

      Weakness:

      • Manuscript is not summarized well. Please send less important findings to supplementary materials. The manuscript is not well written, which includes every little detail in the text, resulting in 86 pages of the manuscript.

      Following reviewers’ comments, we have simplified the manuscript. Now only the eQTMs identified in the model adjusted for cellular composition are reported. In addition, functional enrichment analyses have been simplified without reporting all odds ratios (OR) and p-values, which can be seen in the Figures.

      • Any possible reason that the eQTM methylation probes are enriched in weak transcription regions? This is surprising.

      Bonder et al also found that blood eQTMs were slightly enriched for weak transcription regions (TxWk). Weak transcription regions are highly constitutive and found across many different cell types (Roadmap Epigenetics Consortium, 2015). However, hematopoietic stem cells and immune cells have lower representation of TxWk and other active states, which may be related to their capacity to generate sub-lineages and enter quiescence.

      Given that we analyzed whole blood and that ROADMAP chromatin states are only available for blood specific cell types, each CpG in the array was annotated to one or several chromatin states by taking a state as present in that locus if it was described in at least 1 of the 27 bloodrelated cell types. By applying this strategy we may be “over-representing” TxWk chromatin states, in the case TxWk are cell-type specific. As a result, even if each blood cell type might have few TxWk, many positions can be TxWk in at least one cell type, inflating the CpGs considered as TxWk. This might have affected some of the enrichments.

      On the other hand, CpG probe reliability depends on methylation levels and variance. TxWk regions show high methylation levels, which tend to be measured with more error. This also might have impacted the results, however the analysis considering only reliable probes (ICC >0.4) showed similar enrichment for TxWk.

      Besides these, we do not have a clear answer for the question raised by the reviewer.

      References:

      Bonder MJ, Luijk R, Zhernakova D V, Moed M, Deelen P, Vermaat M, et al. Disease variants alter transcription factor levels and methylation of their binding sites. Nat Genet [Internet]. 2017 [cited 2017 Nov 2];49:131–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27918535

      Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu YC, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh KH, Feizi S, Karlic R, Kim AR, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, De Jager PL, Farnham PJ, Fisher SJ, Haussler D, Jones SJ, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai LH, Wang W, Waterland RA, Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M. Integrative analysis of 111 reference human epigenomes. Nature. 2015 Feb 19;518(7539):317-30. doi: 10.1038/nature14248. PMID: 25693563; PMCID: PMC4530010.

      • The result that the magnitude of the effect was independent of the distance between the CpG and the TC TSS is surprising. Could you draw a figure where x-axis is the distance between the CpG site and TC TSS and y-axis is p-value?

      As suggested by the reviewer, we have taken a more detailed look at the relationship between the effect size and the distance between the CpG and the TC’s TSS. First, we confirmed that the relative orientation (upstream or downstream) did not affect the strength of the association (p-value=0.68). Second, we applied a linear regression between the absolute log2 fold change and the log10 of the distance (in absolute value), finding that they were inversely related. We have updated the manuscript with this information (page 22, line 504):

      “We observed an inverse linear association between the eCpG-eGene’s TSS distance and the effect size (p-value = 7.75e-9, Figure 2B); while we did not observe significant differences in effect size due to the relative orientation of the eCpG (upstream or downstream) with respect to the eGene’s TSS (p-value = 0.68).”

      Results are shown in Figure 2B. Of note, we winsorized effect size values in order to improve the visualization. The winsorizing process is also explained in Figure 2 legend. Moreover, we have done the plot suggested by the reviewer (see below). It shows that associations with smallest p-values are found close to the TC’s TSS. Nonetheless, as this pattern is also observed for the effect sizes, we have decided to not include it in the manuscript.

      • Concerned about too many significant eQTMs. Almost half of genes are associated with methylation. I wonder if false positives are well controlled using the empirical p-values. Using empirical p-value with permutation may mislead since especially you only use 100 permutations. I wonder the result would be similar if they compare their result with the traditional way, either adjusting p-values using p-values from entire TCs or adjusting pvalues using a gene-based method as commonly used in GWAS. Compare your previous result with my suggestion for the first analysis.

      Despite the number of genes (TCs) whose expression is associated with DNA methylation is quite high, we do not think this is due to not correctly controlling false positives. Our approach is based on the method used by GTEx (GTEx consortium) and implemented in the FastQTL package (Ongen et al. 2016), to control for positives in the eQTLs discovery. As in GTEx, we run 100 permutations to estimate the parameters of a beta distribution, which we used to model the distribution of p-values for each CpG. Then, to correct for the number of TCs among significant CpGs, we applied False Discovery Rate (FDR) at a threshold < 0.05. Finally, we defined the final set of significant eQTMs using the beta distribution defined in a previous step.

      For illustration, we compared the number of eQTMs with our approach to what we would obtain by uniquely applying the FDR method (adjusted p-value <0.05), getting fewer associations with our approach: eQTMs (45,203 with FDR vs 39,749 with our approach), eCpGs (24,611 vs 21,966) and eGenes (9,937 vs 8,886). Among the 8,886 significant eGenes, 6,288 of them are annotated to coding genes, thus representing 27% of the 23,054 eGenes coding for a gene included in the array.

      References:

      GTEx consortium, The GTEx Consortium atlas of genetic regulatory effects across human tissues, Science (2020) Sep 11;369(6509):1318-1330. doi: 10.1126/science.aaz1776.

      Ongen et al. Fast and efficient QTL mapper for thousands of molecular phenotypes, Bioinformatics (2016) May 15;32(10):1479-85. doi: 10.1093/bioinformatics/btv722. Epub 2015 Dec 26.

      • I recommend starting with cell type specific results. Without adjusting cell type, the result doesn't make sense.

      As suggested by other reviewers, we have withdrawn the model unadjusted for cellular composition.

      Reviewer #3 (Public Review):

      Although several DNA methylation-gene expression studies have been carried out in adults, this is the first in children. The importance of this is underlined by the finding that surprisingly few associations are observed in both adults and children. This is a timely study and certain to be important for the interpretation of future omic studies in blood samples obtained from children.

      We agree with the reviewer that eQTMs in children are important for interpreting EWAS findings conducted in child cohorts such as those of the Pregnancy And Childhood Epigenetics (PACE) consortium.

      It is unfortunate that the authors chose to base their reporting on associations unadjusted for cell count heterogeneity. They incorrectly claim that associations linked to cell count variation are likely to be cell-type-specific. While possible, it is probably more likely that the association exists entirely due to cell type differences (which tend to be large) with little or no association within any of the cell types (which tend to be much smaller). In the interests of interpretability, it would be better to report only associations obtained after adjusting for cell count variation.

      Following reviewers’ recommendations, we have reconsidered our initial hypothesis about the role of cellular composition in the association between methylation and gene expression. Although we still think that some of the eQTMs only found in the model unadjusted for cellular composition could represent cell specific effects, we acknowledge that the majority might be confounded by the extensive gene expression and DNA methylation differences between cell types. Also, we recognize that more sophisticated statistical tests should be applied to prove our hypothesis. Because of this we have decided to report the eQTMs of the model adjusted for cellular composition in the main manuscript and keep the results of the model unadjusted for cellular composition only in the online catalogue.

      Several enrichments could be related to variation in probe quality across the DNA methylation arrays.

      For example, enrichment for eQTM CpG sites among those that change with age could simply be due to the fact age and eQTM effects are more likely to be observed for CpG sites with high quality probes than low quality probes. It is more informative to instead ask if eQTM CpG sites are more likely to have increasing rather than decreasing methylation with age. This avoids the probe quality bias since probes with positive associations with age would be expected to have roughly the same quality as those with negative associations with age. There are several other analyses prone to the probe quality bias.

      See answer to question 2, below.

    1. Author Response:

      Reviewer #1:

      This work provides insight into the effects of tetraplegia on the cortical representation of the body in S1. By using fMRI and an attempted finger movement task, the researchers were able to show preserved fine-grained digit maps - even in patients without sensory and motor hand function as well as no spared spinal tissue bridges. The authors also explored whether certain clinical and behavioral determinates may contribute to preserving S1 somatotopy after spinal cord injury.

      Overall I found the manuscript to be well-written, the study to be interesting, and the analysis reasonable. I do, however, think the manuscript would benefit by considering and addressing two main suggestions.

      1) Provide additional context / rationale for some of the methods. Specific examples below:

      a) The rationale behind using the RSA analysis seemed to be predicated on the notion that the signals elicited via a phase-encoded design can only yield information about each voxel's preferred digit and little-to-no information about the degree of digit overlap (see lines 163-166 and 571-575). While this is the case for conventional analyses of these signals, there are more recently developed approaches that are now capable of estimating the degree of somatotopic overlap from phase-encoded data (see: Da Rocha Amaral et al., 2020; Puckett et al., 2020). Although I personally would be interested in seeing one of these types of analyses run on this data, I do not think it is necessary given the RSA data / analysis. Rather, I merely think it is important to add some context so that the reader is not misled into believing that there is no way to estimate this type of information from phase-encoded signals. - Da Rocha Amaral S, Sanchez Panchuelo RM, Francis S (2020) A Data-Driven Multi-scale Technique for fMRI Mapping of the Human Somatosensory Cortex. Brain Topogr 33 (1):22-36. doi:10.1007/s10548-019-00728-6 - Puckett AM, Bollmann S, Junday K, Barth M, Cunnington R (2020) Bayesian population receptive field modeling in human somatosensory cortex. Neuroimage 208:116465. doi:10.1016/j.neuroimage.2019.116465

      We did not intend to give the impression that inter-finger overlap can only be estimated using RSA. To clarify this, we included a sentence in our methods section stating that inter-finger overlap cannot be estimated using the traditional travelling wave approach, but new methods have estimated somatotopic overlap from travelling wave data. Since our RSA approach lends itself for estimating inter-finger overlap and is currently the gold standard in characterizing these representational patterns, we opt –in accordance with the reviewer’s comment– not to include this additional analysis.

      Revised text Methods:

      “While the traditional traveling wave approach is powerful to uncover the somatotopic finger arrangement, a fuller description of hand representation can be obtained by taking into account the entire fine-grained activity pattern of all fingers. RSA-based inter-finger overlap patterns have been shown to depict the invariant representational structure of fingers better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). RSA-based measures are furthermore not prone to some of the problems of measurements of finger selectivity (e.g., dependence on map thresholds). The most common approach for investigating inter-finger overlap is RSA, as used here, though note that somatotopic overlap has recently been estimated from travelling wave data using an iterated Multigrid Priors (iMGP) method and population receptive field modelling (Da Rocha Amaral et al., 2020; Puckett et al., 2020).”

      b. The rationale for using minimally thresholded (Z>2) data for the Dice overlap analysis as opposed to the threshold used in data visualization (q<0.05) was unclear. Providing the minimally thresholded maps (in Supplementary) would also aid interpretation of the Dice overlap results.

      We followed previously published procedures for calculating the Dice overlap between the two split-halves of the data (Kikkert et al., 2016; J. Kolasinski et al., 2016; Sanders et al., 2019). We used minimally thresholded data to calculate the dice overlap to ensure that our analysis was sensitive to overlaps that would be missed when using high thresholds. We clarified this in the revised manuscript. We thank the reviewer for their suggestion to add a Figure displaying the minimally thresholded split-half hard-edged finger maps - we have added this to the revised manuscript as Figure 2-Figure supplement 1.

      To ensure that our thresholding procedure did not change the results of the dice overlap analysis, we repeated this analysis using split-half maps that were thresholded using a q < 0.05 FDR criterion (as was used to create the travelling wave maps in Figures 2A-B). We found the same results as when using the Z >2 thresholding criterion: Overall, split-half consistency was not significantly different between patients and controls, as tested using a robust mixed ANOVA (F(1,17.69) = 0.08, p = 0.79). There was a significant difference in split- half consistency between pairs of same, neighbouring, and non-neighbouring fingers (F(2,14.77) = 38.80, p < 0.001). This neighbourhood relationship was not significantly different between the control and patient groups (i.e., there was no significant interaction; F(2,14.77) = 0.12, p = 0.89). We have included this analysis and the relating figure as Figure 2- Figure supplement 2 in the revised manuscript.

      Revised text Methods:

      “We followed previously described procedures for calculating the DOC between two halves of the travelling wave data (Kikkert et al., 2016; Kolasinski et al., 2016; Sanders et al., 2019). The averaged finger-specific maps of the first forward and backward runs formed the first data half. The averaged finger-specific maps of the second forward and backward runs formed the second data half. The finger-specific clusters were minimally thresholded (Z>2) on the cortical surface and masked using an S1 ROI, created based on Brodmann area parcellation using Freesurfer (see Figure 2– figure supplement 1 for a visualisation of the minimally thresholded split-half hard-edged finger maps used to calculate the DOC). We used minimally thresholded finger-specific clusters for the DOC analysis to ensure we were sensitive to overlaps that would be missed when using high thresholds. Note that results were unchanged when thresholding the finger-specific clusters using an FDR q < 0.05 criterion (see Figure 2 – figure supplement 2).”

      2) Provide a more thorough discussion - particularly with respect to the possible role of top-down processes (e.g., attention).

      a) The authors discuss a few potential signal sources that may contribute to the maintenance of (and ability to measure) the somatotopic maps; however, the overall interpretation seems a bit "motor efferent heavy". That is, it seems the authors favor an explanation that the activity patterns measured in S1 were elicited by efference copies from the motor system and that occasional corollary discharges or attempted motor movements play a role in their maintenance over time. The authors consider other explanations, noting - for example - the potential role of attention in preserving the somatotopic representations given that attention has been shown to be able to activate S1 hand representations. The mention of this was, however, rather brief - and I believe the issue deserves a bit more of a balanced consideration.

      When the authors consider the possible role of attention in maintaining the somatotopic representations (lines 329-333), they mention that observing others' fingers being touched or attending to others' finger movements may contribute. But there is no mention of attending to one's own fingers (which has been shown to elicit activity as cited). I realize that the patients lack sensorimotor function (and hence may find it difficult to "attend" to their fingers); however, they have all had prior experience with their fingers and therefore might still be able to attend to them (or at least the idea of their digits) such that activity is elicited. For example, it is not clear to me that it would be any more difficult for the patients to be asked to attend to their digits compared to being asked to attempt to move their digits. I would even suggest that attempting to move a digit (regardless of whether you can or not) requires that one attends to the digit before attempting to initiate the movement as well as throughout the attempted motor movement. Because of this, it seems possible that attention-related processes could be playing a role in or even driving the signals measured during the attempted movement task - as well as those involved in the ongoing maintenance of the maps after injury. I don't think this possibility can be dismissed given the data in hand, but perhaps the issue could be addressed by a bit more thorough of a discussion on the process of "attempting to move" a digit (even one that does not move) - and the various top-down processes that might be involved.

      We thank the reviewer for their consideration and insights into the potential mechanisms underlying our results. We have now elaborated further on the possibility that attention- related processes might have contributed to the reported effects, also in consideration of comment 3.4.

      Revised text Discussion:

      “Spared spinal cord tissue bridges can be found in most patients with a clinically incomplete injury, their width being predictive of electrophysiological information flow, recovery of sensorimotor function, and neuropathic pain (Huber et al., 2017; Pfyffer et al., 2021, 2019; Vallotton et al., 2019). However, in this study, spared midsagittal spinal tissue bridges at the lesion level, motor function, and sensory function did not seem necessary to maintain and activate a somatotopic hand representation in S1. We found a highly typical hand representation in two patients (S01 and S03) who did not have any spared spinal tissue bridges at the lesion level, a complete (S01) or near complete (S03) hand paralysis, and a complete (S01) or near complete loss (S03) of hand sensory function. Our predictive modelling results were in line with this notion and showed that these behavioural and structural spinal cord determinants were not predictive of hand representation typicality. Note however that our sample size was limited, and it is challenging to draw definite conclusions from non-significant predictive modelling results.”

      “How may these representations be preserved over time and activated through attempted movements in the absence of peripheral information? S1 is reciprocally connected with various brain areas, e.g., M1, lateral parietal cortex, poster parietal area 5, secondary somatosensory cortex, and supplementary motor cortex (Delhaye et al., 2019). After loss of sensory inputs and paralysis through SCI, S1 representations may be activated and preserved through its interconnections with these areas. Firstly, it is possible that cortico-cortical efference copies may keep a representation ‘alive’ through occasional corollary discharge (London and Miller, 2013). While motor and sensory signals no longer pass through the spinal cord in the absence of spinal tissue bridges, S1 and M1 remain intact. When a motor command is initiated (e.g., in the form of an attempted hand movement) an efference copy is thought to be sent to S1 in the form of corollary discharge. This corollary discharge resembles the expected somatosensory feedback activity pattern and may drive somatotopic S1 activity even in the absence of ascending afferent signals from the hand (Adams et al., 2013; London and Miller, 2013). It is possible that our patients occasionally performed attempted movements which would result in corollary discharge in S1. Second, it is likely that attempting individual finger movements poses high attentional demands on tetraplegic patients. Accordingly, attentional processes might have contributed to eliciting somatotopic S1 activity. Evidence for this account comes from studies showing that it is possible to activate somatotopic S1 hand representations through attending to individual fingers (Puckett et al., 2017) or through touch observation (Kuehn et al., 2018). Attending to fingers during our attempted finger movement task may have been sufficient to elicit somatotopic S1 activity through top-down processes in the tetraplegic patients who lacked hand motor and sensory function. Furthermore, one might speculate that observing others’ or one’s own fingers being touched or directing attention to others’ hand movements or one’s own fingers may help preserve somatotopic representations. Third, it is possible that these somatotopic maps are relatively hardwired and while they deteriorate over time, they never fully disappear. Indeed, somatotopic mapping of a sensory deprived body part has been shown to be resilient after dystonia (Ejaz et al., 2016; though see Burman et al., (2009) and Taub et al., (1998)) and arm amputation (Bruurmijn et al., 2017; Kikkert et al., 2016; Wesselink et al., 2019). Fourth, it is possible that even though a patient is clinically assessed to be complete and is unable to perceive sensory stimuli on the deprived body part, there is still some ascending information flow that contributes to preserving somatotopy (Wrigley et al., 2018). A recent study found that although complete paraplegic SCI patients were unable to perceive a brushing stimulus on their toe, 48% of patients activated the location appropriate S1 area (Wrigley et al., 2018). However, the authors of this study defined the completeness of patients’ injuries via behavioural testing, while we additionally assessed the retained connections passing through the SCI directly via quantification of spared spinal tissue bridges through structural MRI. It is unlikely that spinal tissue carrying somatotopically organised information would be missed by our assessment (Huber et al., 2017; Pfyffer et al., 2019). Our experiment did not allow us to tease apart these potential processes and it is likely that various processes simultaneously influence the preservation of S1 somatotopy and elicited the observed somatotopic S1 activity.”

      Reviewer #2:

      The authors investigate SCI patients and characterize the topographic representation of the hand in sensorimotor cortex when asked to move their hand (which controls could do but patients could not). The authors compare some parameters of topographic map organization and conclude that they do not differ between patients and controls, whereas they find changes in the typicality of the maps that decrease with years since disease onset in patients. Whereas these initial analyses are interesting, they are not clearly related to a mechanistic model of the disorder and the underlying pathophysiology that is expected in the patients. Furthermore, additional analyses on more fine-grained map changes are needed to support the authors' claims. Finally, the major result of changed typicality in the patients is in my view not valid.

      • Concept 1. At present, there is no clear hypotheses about the (expected or hypothesized) mechanistic changes of the sensorimotor maps in the patients. The authors refer to "altered" maps and repeatedly say that "results are mixed" (3 times in the introduction).

      We thank the reviewer for highlighting to us that our introduction and hypotheses were unclear and/or incomplete to them. We have restructured our Introduction to better highlight competing hypotheses on how SCI may change S1 hand representations, the reasons for our analytical approach, and elaborate on our hypotheses.

      Revised text Introduction:

      “Research in non-human primate models of chronic and complete cervical SCI has shown that the S1 hand area becomes largely unresponsive to tactile hand stimulation after the injury (Jain et al., 2008; Kambi et al., 2014; Liao et al., 2021). The surviving finger-related activity became disorganised such that a few somatotopically appropriate sites but also other somatotopically nonmatched sites were activated (Liao et al., 2021). Seminal nonhuman primate research has further demonstrated that SCI leads to extensive cortical reorganisation in S1, such that tactile stimulation of cortically adjacent body parts (e.g., of the face) activated the deprived brain territory (e.g., of the hand; Halder et al., 2018; Jain et al., 2008; Kambi et al., 2014). Although the physiological hand representation appears to largely be altered following a chronic cervical SCI in non-human primates, the anatomical isomorphs of individual fingers are unchanged (Jain et al., 1998). This suggests that while a hand representation can no longer be activated through tactile stimulation after the loss of afferent spinal pathways, a latent and somatotopic hand representation could be preserved regardless of large-scale physiological reorganisation.

      A similar pattern of results has been reported for human SCI patients. Transcranial magnetic stimulation (TMS) studies induced current in localised areas of SCI patient’s M1 to induce a peripheral muscle response. They found that representations of more impaired muscles retract or are absent while representations of less impaired muscles shift and expand (Fassett et al., 2018; Freund et al., 2011a; Levy et al., 1990; Streletz et al., 1995; Topka et al., 1991; Urbin et al., 2019). Similarly, human fMRI studies have shown that cortically neighbouring body part representations can shift towards, though do not invade, the deprived M1 and S1 cortex (Freund et al., 2011b; Henderson et al., 2011; Jutzeler et al., 2015; Wrigley et al., 2018, 2009). Other human fMRI studies hint at the possibility of latent somatotopic hand representations following SCI by showing that attempted movements with the paralysed and sensory deprived body part can still evoke signals in the sensorimotor system (Cramer et al., 2005; Freund et al., 2011b; Kokotilo et al., 2009; Solstrand Dahlberg et al., 2018). This attempted ‘net’ movement activity was, however, shown to substantially differ from healthy controls: Activity levels have been shown to be increased (Freund et al., 2011b; Kokotilo et al., 2009; Solstrand Dahlberg et al., 2018) or decreased (Hotz- Boendermaker et al., 2008), volumes of activation have been shown to be reduced (Cramer et al., 2005; Hotz-Boendermaker et al., 2008), activation was found in somatotopically nonmatched cortical sites (Freund et al., 2011b), and activation was poorly modulated when patients switched from attempted to imagined movements (Cramer et al., 2005). These observations have therefore mostly been attributed to abnormal and/or disorganised processing induced by the SCI. It remains possible though that, despite certain aspects of sensorimotor activity being altered after SCI, somatotopically typical representations of the paralysed and sensory deprived body parts can be preserved (e.g., finger somatotopy of affected hand). Such preserved representations have the potential to be exploited in a functionally meaningful manner (e.g., via neuroprosthetics).

      Case studies using intracortical stimulation in the S1 hand area to elicit finger sensations in SCI patients hint at such preserved somatotopic representations (Fifer et al., 2020; Flesher et al., 2016), with one exception (Armenta Salas et al., 2018). Negative results were suggested to be due to a loss of hand somatotopy and/or reorganisation in S1 of the implanted SCI patient or due to potential misplacement of the implant (Armenta Salas et al., 2018). Whether fine-grained somatotopy is generally preserved in the tetraplegic patient population remains unknown. It is also unclear what clinical, behavioural, and structural spinal cord determinants may influence such representations to be maintained. Here we used functional MRI (fMRI) and a visually cued (attempted) finger movement task in tetraplegic patients to examine whether hand somatotopy is preserved following a disconnection between the brain and the periphery. We instructed patients to perform the fMRI tasks with their most impaired upper limb and matched controls’ tested hands to patients’ tested hands. If a patient was unable to make overt finger movements due to their injury, then we carefully instructed them to make attempted (i.e., not imagined) finger movements. To see whether patient’s maps exhibited characteristics of somatotopy, we visualised finger selectivity in S1 using a travelling wave approach. To investigate whether fine-grained hand somatotopy was preserved and could be activated in S1 following SCI, we assessed inter-finger representational distance patterns using representational similarity analysis (RSA). These inter-finger distance patterns are thought to be shaped by daily life experience such that fingers used more frequently together in daily life have lower representational distances (Ejaz et al., 2015). RSA-based inter-finger distance patterns have been shown to depict the invariant representational structure of fingers in S1 and M1 better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). Over the past years RSA has therefore regularly been used to investigate somatotopy of finger representations both in healthy (e.g., Akselrod et al., 2017; Ariani et al., 2020; Ejaz et al., 2015; Gooijers et al., 2021; Kieliba et al., 2021; Kolasinski et al., 2016; Liu et al., 2021; Sanders et al., 2019) and patient populations (e.g., Dempsey-Jones et al., 2019; Ejaz et al., 2016; Kikkert et al., 2016; Wesselink et al., 2019). We closely followed procedures that have previously been used to map preserved and typical somatotopic finger selectivity and inter-finger representational distance patterns of amputees’ missing hands in S1 using volitional phantom finger movements (Kikkert et al., 2016; Wesselink et al., 2019). However, in amputees, these movements generally recruit the residual arm muscles that used to control the missing limb via intact connections between the brain and spinal cord. Whether similar preserved somatotopic mapping can be observed in SCI patients with diminished or no connections between the brain and the periphery is unclear. If finger somatotopy is preserved in tetraplegic patients, then we should find typical inter-finger representational distance patterns in the S1 hand area of these patients. By measuring a group of fourteen chronic tetraplegic patients with varying amounts of spared spinal cord tissue at the lesion level (quantified by means of midsagittal tissue bridges based on sagittal T2w scans), we uniquely assessed whether preserved connections between the brain and periphery are necessary to preserve fine somatotopic mapping in S1 (Huber et al., 2017; Pfyffer et al., 2019). If spared connections between the periphery and the brain are not necessary for preserving hand somatotopy, then we would find typical inter-finger representational distance patterns even in patients without spared spinal tissue bridges. We also investigated what clinical and behavioural determinants may contribute to preserving S1 hand somatotopy after chronic SCI. If spared sensorimotor hand function is not necessary for preserving hand somatotopy, then we would find typical inter-finger representational distance patterns even in patients who suffer from full sensory loss and paralysis of the hand(s).”

      They do not in detail report which results actually have been reported before, which is a major problem, because those prior results should have motivated the analyses the authors conducted. For instance, two of the cited studies found that in SCI patients, only ONE FINGER shifted towards the malfunctioning area (i.e., the small finger) whereas all other fingers were the same. However, the authors do NOT perform single finger analyses but always average their results ACROSS fingers. This is even true in spite of some patients indeed showing MISSING FINGERS as is clearly evident in the figure, and in spite of the clearly reduced distance of the thumb in the patients as is also visible in another figure. Nothing of this is seen in the results, because the ANOVA and analyses never have the factor of "finger". Instead, the authors always average the analyses across finger. The conclusion that the maps do not differ is therefore not justified at present. This severely reduces any conclusions that an be drawn from the data at present.

      We apologise for the lack of clarity. We now added additional detail regarding studies showing altered sensorimotor processing following SCI. We also clarified that we based our analysis steps on previous studies investigating hand somatotopy following deafferentation (i.e., following arm amputation; Kikkert et al., 2016; Wesselink et al., 2019) and somatotopic reorganisation RSA- based inter-finger distance patterns have been shown to depict the invariant representational structure of fingers in S1 and M1 better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). Over the past years RSA has therefore regularly been used to investigate somatotopy of finger representations both in healthy (e.g., Akselrod et al., 2017; Ariani et al., 2020; Ejaz et al., 2015; Gooijers et al., 2021; Kieliba et al., 2021; Kolasinski et al., 2016; Liu et al., 2021; Sanders et al., 2019) and patient populations (e.g. Dempsey-Jones et al., 2019; Ejaz et al., 2016; Kikkert et al., 2016; Wesselink et al., 2019). It is believed to be the most appropriate measure to reliably detect subtle changes in somatotopy. We adjusted the text in our revised Introduction section to better highlight this.

      Please note that we do not average across fingers in our RSA typicality procedure. Instead, RSA considers how the (attempted) movement with one finger changes the activity pattern across the whole hand representation. Note that somatotopic reorganisation will change the inter-finger distance measured with this method as previously shown (Kieliba et al., 2021; Kolasinski et al., 2016; Wesselink et al., 2019).

      Still, as per the reviewer’s suggestion, we conducted a robust mixed ANOVA on the RSA distance measures with a within-subjects factor for finger pair (10 levels) and a between- subjects factor for group (2 levels: controls and SCI patients). We did not find a significant group effect (F(1,21.66) = 1.50, p = 0.23). There was a significant difference in distance between finger pairs (F(9,15.38) = 27.22, p < 0.001), but this was not significantly different between groups (i.e., no significant finger pair by group interaction; F(9,15.38) = 1.05, p = 0.45). When testing for group differences per finger pair, the BF only revealed inconclusive evidence (BF > 0.37 and < 1.11; note that we could not run a Bayesian ANOVA due to normality violations). We have added this analysis to the revised manuscript.

      Lastly, we would like to highlight that our argument is that the finger maps can be preserved in the absence of sensory and motor function, but over time they deteriorate and become less somatotopic. As such, we do not aim to state that they are unchanged overall – but rather that they can be unchanged even despite loss of sensory and motor function. We have clarified this in our abstract and manuscript to avoid confusion.

      Revised abstract:

      “Previous studies showed reorganised and/or altered activity in the primary sensorimotor cortex after a spinal cord injury (SCI), suggested to reflect abnormal processing. However,little is knownaboutwhether somatotopically-specific representations can be preserved despite alterations in net activity. In this observational study we used functional MRI and an (attempted) finger movement task in tetraplegic patients to characterise the somatotopic hand layout in primary somatosensory cortex. We further used structural MRI to assess spared spinal tissue bridges. We found that somatotopic hand representations can be preserved in absence of sensory and motor hand functioning, and no spared spinal tissue bridges. Such preserved hand somatotopy could be exploited by rehabilitation approaches that aim to establish new hand-brain functional connections after SCI (e.g., neuroprosthetics). However, over years since SCI the hand representation somatotopy deteriorated, suggesting that somatotopic hand representations are more easily targeted within the first years after SCI.”

      Revised text Methods:

      “Second, we tested whether the inter-finger distances were different between controls and patients using a robust mixed ANOVA with a within-participants factor for finger pair (10 levels) and a between-participants factor for group (2 levels: controls and patients).”

      Revised text Results:

      “We then tested whether the inter-finger distances were different across finger pairs between controls and SCI patients using a robust mixed ANOVA with a within-participants factor for finger pair (10 levels) and a between-participants factor for group (2 levels: controls and patients). We did not find a significant difference in inter-finger distances between patients and controls (F(1,21.66) = 1.50, p = 0.23). The inter-finger distances were significantly different across finger pairs, as would be expected based on somatotopic mapping (F(9,15.38) = 27.22, p < 0.001). This pattern of inter-finger distances was not significantly different between groups (i.e., no significant finger pair by group interaction; F(9,15.38) = 1.05, p = 0.45). When testing for group differences per finger pair, the BF only revealed inconclusive evidence (BF > 0.37 and < 1.11; note that we could not run a Bayesian ANOVA due to normality violations).”

      Revised text Discussion:

      “In this study we investigated whether hand somatotopy is preserved and can be activated through attempted movements following tetraplegia. We tested a heterogenous group of SCI patients to examine what clinical, behavioural, and structural spinal cord determinants contribute to preserving S1 somatotopy. Our results revealed that detailed hand somatotopy can be preserved following tetraplegia, even in the absence of sensory and motor function and a lack of spared spinal tissue bridges. However, over time since SCI these finger maps deteriorated such that the hand somatotopy became less typical.”

      • Concept 2: This also relates to the fact that the most prominent and consistent finding of prior studies was to show changes in map AMPLITUDE in the maps of patients. It is not clear to me how amplitude was measured here, because the text says "average BOLD activity". What should be reported are standard measures of signal amplitude both across the map area and for individual fingers.

      We apologise for the lack of clarity, “average BOLD activity” represented the average z- standardised activity within the S1 hand ROI. To comply with the reviewer’s comment, we adjusted this to the percent signal change underneath the S1 hand ROI and report this instead in our revised manuscript and in revised Figure 3A and revised Figure 3- Figure supplement 1. Note that results were unchanged.

      As per the reviewer’s suggestion, we further extracted the activity levels for individual fingers under finger-specific ROIs. To create finger-specific ROIs, probability finger maps were created based on the travelling wave data of the control group, thresholded at 25% (i.e., meaning that at least 5 out of 18 control participants needed to significantly activate a vertex for this vertex to be included in the ROI), and binarised. We then used the separately acquired blocked design data to extract the corresponding finger movement activity levels underlying these finger-specific ROIs per participant. Per ROI, we then compared the activity level between groups. After correction for multiple comparisons, there was no significant difference between groups for the thumb (U = 93, p = 0.37), index (t(30) = -0.003, p = 0.99), middle (t(30) = 1.11, p = 0.35), ring (t(30) = 2.02, p = 0.13), or little finger (t(30) = 2.14, p = 0.20). We have added this analysis to Appendix 1.

      Note that lower or higher BOLD amplitude levels do not influence our typicality scores per se. Indeed, typical inter-finger representational patterns have been shown to persist even in ipsilateral M1 that exhibited a negative BOLD response during finger movements (Berlot et al., 2019). As long as the typical inter-finger relationships are preserved, brain areas that have low amplitudes of activity can have a typical somatotopic representation.

      Revised text in Methods:

      "The percent signal change for overall task-related activity was then extracted for voxels underlying this S1 hand ROI per participant. A similar analysis was used to investigate overall task-related activity in an M1 hand ROI (see Figure 3- Figure supplement 1). We further compared activity levels in finger-specific ROIs in S1 between groups and conducted a geodesic distance analysis to assess whether the finger representations of the SCI patients were aligned differently and/or shifted compared to the control participants (see Appendix 1)."

      Revised text in Results:

      “Task-related activity was quantified by extracting the percent signal change for finger movement (across all fingers) versus baseline across within the contralateral S1 hand ROI (see Figure 3A). Overall, all patients were able to engage their S1 hand area by moving individual fingers (t(13)=7.46, p < 0.001; BF10=4.28e +3), as did controls (t(17)=9.92, p < 0.001; BF10=7.40e +5). Furthermore, patients’ task-related activity was not significantly different from controls (t(30)=-0.82, p=0.42; BF10=0.44), with the BF showing anecdotal evidence in favour of the null hypothesis.”

      Revised Appendix 1:

      “Percent signal change in finger-specific clusters To assess whether finger movement activity levels were different between patients and controls, we created finger-specific ROIs and extracted the activity level of the corresponding finger movement for each participant. To create the finger-specific ROIs, the probability finger surface maps that were created from the travelling wave data of the control group (see main manuscript) were thresholded at 25% (i.e., meaning that at least 5 out of 18 control participants needed to significantly activate a vertex for this vertex to be included in the ROI), and binarised. We then used the separately acquired blocked design data to extract the finger movement activity levels underlying these finger-specific ROIs. We first flipped the contrast images resulting from each participant’s fixed effects analysis (i.e., that was ran to average across the 4 blocked design runs) along the x-axis for the left-hand tested participants. Each participant’s contrast maps were then resampled to the Freesurfer 2D average atlas and the averaged z-standardised activity level was extracted for each finger movement vs rest contrast underlying the finger-specific ROIs. We compared the activity levels for each finger movement in the corresponding finger ROI (i.e., thumb movement activity in the thumb ROI, index finger movement activity in the index finger ROI, etc.) between groups. After correction for multiple comparisons, there was no significant difference between groups for the thumb (U = 93, p = 0.37), index (t(30) = -0.003, p = 0.99), middle (t(30) = 1.11, p = 0.35), ring (t(30) = 2.02, p = 0.13), or little finger (t(30) = 2.14, p = 0.20).”

      Appendix 1- Figure 1: Finger-specific activity levels in finger-specific regions of interest. A) Finger- specific ROIs were based on the control group’s binarised 25% probability travelling wave finger selectivity maps. B) Finger movement activity levels in the corresponding finger-specific ROIs. There were no significant differences in activity levels between the SCI patient and control groups. Controls are projected in grey; SCI patients are projected in orange. Error bars show the standard error of the mean. White arrows indicate the central sulcus. A = anterior; P = posterior.

      • Concept 3: The authors present a hypothesis on the underlying mechanisms of SCI that does not seem to reflect prior data. The argument is that changes in map alignment relate to maladaptive changes and pain. However, the literature that the authors cite does not support this claim. In fact, Freund 2011 promotes the importance of map amplitude but not alignment, whereas other studies either show no relation of activation to pain, or they even show that map shift relates to LESS pain, i.e., the reverse argument than what the authors say. My impression is that the model that the authors present is mainly a model that is used for phantom pain but not for SCI. Taking this into consideration, the findings the authors present are not surprising anymore, because in fact none of these studies claimed that the affected area should be absent in SCI patients; these papers only say that the other body parts change in location or amplitude, which is something the authors did not measure. It is important to make this clear in the text.

      As the reviewer states, the literature is debated regarding the relationship between reorganisation and pain in SCI patients. We did not highlight this clearly enough. To improve clarity and focus our message we have therefore removed the sentence regarding reorganisation and pain from the Introduction of our revised manuscript. Also taking comment 2.1 and 2.2 into consideration, we have restructured our Introduction.

      We respectfully disagree with the reviewer that our results are not novel or surprising. Whether the full fine-grained hand somatotopy is preserved following a complete motor and sensory loss through tetraplegia has not been considered before. Furthermore, to our knowledge, there is no paper that has inspected the full somatotopic layout in a heterogenous sample of SCI patients and shown that over time since injury, hand somatotopy deteriorates. We indeed cannot make claims regarding the reorganization in S1 with regards to neighbouring cortical areas activating the hand area, as we have now clarified further in the revised Discussion. We now also clarify in our discussion that our result does not exclude the possibility of reorganisation occurring simultaneously and that this is topic for further investigation. As described in the Discussion, it is very possible that reorganisation and preserved somatotopy could co-occur.

      Revised text Discussion:

      “We did not probe body parts other than the hand and could therefore not investigate whether any remapping of other (neighbouring and/or intact) body part representations towards or into the deprived S1 hand cortex may have taken place. Whether reorganisation and preservation of the original function can simultaneously take place within the same cortical area therefore remains a topic for further investigation. It is possible that reorganisation and preservation of the original function could co-occur within cortical areas. Indeed, non-human primate studies demonstrated that remapping observed in S1 actually reflects reorganisation in subcortical areas of the somatosensory pathway, principally the brainstem (Chand and Jain, 2015; Kambi et al., 2014). As such, the deprived S1 area receives reorganised somatosensory inputs upon tactile stimulation of neighbouring intact body parts. This would simultaneously allow the original S1 representation of the deprived body part to be preserved, as observed in our results when we directly probed the deprived S1 hand area through attempted finger movements.”

      • Concept 4: There is yet another more general point on the concept and related hypotheses: Why do the authors assume that immediately after SCI the finger map should disappear? This seems to me the more unlikely hypotheses compared to what the data seem to suggest: preservation and detoriation over time. In my view, there is no biological model that would suggest that a finger map suddenly disappears after input loss. How should this deterioration be mediated? By cellular loss? As already stated above, the finding is therefore much less surprising as the authors argue.

      We did not expect that finger maps would disappear, especially given the case studies using S1 intracortical stimulation studies in SCI patients and the result of preserved somatotopy of the missing hand in amputees. We are not sure which part of the manuscript might have caused this misunderstanding.

      With regards to the reviewer’s comment that there are no models to suggest that fingers maps would disappear: there is competing research on this as we now explain in our revised Introduction. Non-human primate research has shown that the S1 hand area becomes largely unresponsive to tactile hand stimulation after an SCI (Jain et al., 2008; Kambi et al., 2014; Liao et al., 2021). The surviving finger-related activity was shown to be disorganised such that a few somatotopically appropriate sites but also other somatotopically nonmatched sites were activated (Liao et al., 2021). These fingers areas in S1 became responsive to touch on the face. Furthermore, TMS studies that induce current in localised areas of M1 to induce a peripheral muscle response in SCI patients have shown that representations of more impaired muscles retract or are absent (Fassett et al., 2018; Freund et al., 2011a; Levy et al., 1990; Streletz et al., 1995; Topka et al., 1991; Urbin et al., 2019). We do not believe that this indicates that the S1 hand somatotopy is lost, but rather that tactile inputs and motor outputs no longer pass the level of injury. Indeed, non-human primate work showing immutable myelin borders between finger representations in S1 post SCI suggests that a latent hand representation may be preserved. Further hints for such preserved somatotopy comes from fMRI studies showing net sensorimotor activity during attempted movements with the paralysed body part, intracortical stimulation studies in SCI patients, and preserved somatotopic maps of the missing hand in amputees. We have restructured our Introduction accordingly, also taking into consideration comments 2.1, 2.2, and 2.4.

      • Methods & Results. The authors refer to an analyses that they call "typicality" where they say that they assess how "typical" a finger map is. Given this is not a standard measure, I was wondering how the authors decided what a "typical" finger map is. In fact, there are a few papers published on this issue where the average location of each finger in a large number of subjects is detailed. Rather than referring to this literature, the authors use another dataset from another study of themselves that was conduced on n=8 individuals and using 7T MRI (note that in the present study, 3T MRI was used) to define what "typical" is. This approach is not valid. First, this "typical" dataset is not validated for being typical (i.e., it is not compared with standard atlases on hand and finger location), second, it was assessed using a different MRI field strength, third, it was too little subjects to say that this should be a typical dataset, forth, the group differed from the patients in terms of age and gender (i.e., non-matched group), and fifth, the authors even say that the design was different ("was defined similarly", i.e., not the same). This approach is therefore in my view not valid, particularly given the authors measured age- and gender-matched controls that should be used to compare the maps with the patients. This is a critical point because changes in typicality is the main result of the paper.

      We respectfully disagree with the reviewer that the typicality measure is not standard, invalid, and inaccurate. RSA-based inter-finger overlap patterns have been shown to depict the invariant representational structure of fingers better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). RSA-based inter- finger representation measures have been shown to have more within-subject stability (both within the same session and between sessions that were 6 months apart) and less inter-subject variability (Ejaz et al., 2015) than these other measures of somatotopy. RSA-based measures are furthermore not prone to some of the problems of measurements of finger selectivity (e.g., dependence on map thresholds). Indeed, over the past years RSA has become the golden standard to investigate somatotopy of finger representations both in healthy (e.g., Akselrod et al., 2017; Ariani et al., 2020; Ejaz et al., 2015; Gooijers et al., 2021; Kieliba et al., 2021; Kolasinski et al., 2016; Liu et al., 2021; Sanders et al., 2019) and patient populations (e.g. Dempsey-Jones et al., 2019; Ejaz et al., 2016; Kikkert et al., 2016; Wesselink et al., 2019). Moreover, various papers have been published in eLife and elsewhere that used the same RSA-based typicality criteria to assess plasticity in finger representations (Dempsey-Jones et al., 2019; Ejaz et al., 2015; Kieliba et al., 2021; Wesselink et al., 2019). We now highlight this in the revised Introduction.

      The canonical RDM used in our study has previously been used as a canonical RDM in a 3T study exploring finger somatotopy in amputees (Wesselink et al., 2019) and was made available to us (note that we did not collect this data ourselves). We aimed to use similar measures as in Wesselink et al (2019) and therefore felt it was most appropriate to use the same canonical RDM. One of the strengths of RSA is it can be used to quantitatively relate brain activity measures obtained using different modalities, across different species, brain areas, brain and behavioural measures etc. (Kriegeskorte et al., 2008). As such, the fact that this canonical RDM was constructed based on data collected using 7T fMRI using a digit tapping task should not influence our results. We however agree with the reviewer it is good to demonstrate that our results would not change when using a canonical RDM based on the average RDM of our age-, sex- and handedness matched control group. We therefore recalculated the typicality of all participants using the controls’ average RDM as the canonical RDM. We found a strong and highly significant correlation in typicality scores calculated using the canonical RDM from the independent dataset and the controls’ average RDM (see figure below). This was true for both the patient (rs = 0.92, p < 0.001; red dots) and control groups (rs = 0.78, p < 0.001; grey dots).

      We then repeated all analysis using these newly calculated typicality scores. As expected, we found the same results as when using a canonical RDM based on the independent dataset (see below for details). This analysis has been added to the revised Appendix 1 and is referred to in the main manuscript.

      Revised text Introduction:

      “To investigate whether fine-grained hand somatotopy was preserved and could be activated in S1 following SCI, we assessed inter-finger representational distance patterns using representational similarity analysis (RSA). These inter-finger distance patterns are thought to be shaped by daily life experience such that fingers used more frequently together in daily life have lower representational distances (Ejaz et al., 2015). RSA-based inter-finger distance patterns have been shown to depict the invariant representational structure of fingers in S1 and M1 better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). Over the past years RSA has therefore regularly been used to investigate somatotopy of finger representations both in healthy (e.g., Akselrod et al., 2017; Ariani et al., 2020; Ejaz et al., 2015; Gooijers et al., 2021; Kieliba et al., 2021; Kolasinski et al., 2016; Liu et al., 2021; Sanders et al., 2019) and patient populations (e.g., Dempsey- Jones et al., 2019; Ejaz et al., 2016; Kikkert et al., 2016; Wesselink et al., 2019). We closely followed procedures that have previously been used to map preserved and typical somatotopic finger selectivity and inter-finger representational distance patterns of amputees’ missing hands in S1 using volitional phantom finger movements (Kikkert et al., 2016; Wesselink et al., 2019).”

      Revised text Results:

      “This canonical RDM was based on 7T finger movement fMRI data in an independently acquired cohort of healthy controls (n = 8). The S1 hand ROI used to calculated this canonical RDM was defined similarly as in the current study (see Wesselink and Maimon- Mor, (2017b) for details). Note that results were unchanged when calculating typicality scores using a canonical RDM based on the averaged RDM of the age-, sex-, and handedness-matched control group tested in this study (see Appendix 1).”

      Revised text Methods:

      “While the traditional traveling wave approach is powerful to uncover the somatotopic finger arrangement, a fuller description of hand representation can be obtained by taking into account the entire fine-grained activity pattern of all fingers. RSA-based inter-finger overlap patterns have been shown to depict the invariant representational structure of fingers better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). RSA-based measures are furthermore not prone to some of the problems of measurements of finger selectivity (e.g., dependence on map thresholds).”

      “Third, we estimated the somatotopic typicality (or normality) of each participant’s RDM by calculating a Spearman correlation with a canonical RDM. We followed previously described procedures for calculating the typicality score (Dempsey-Jones et al., 2019; Ejaz et al., 2015; Kieliba et al., 2021; Wesselink et al., 2019). The canonical RDM was based on 7T finger movement fMRI data in an independently acquired cohort of healthy controls (n = 8). The S1 hand ROI used to calculated this canonical RDM was defined similarly as in the current study (see Wesselink and Maimon-Mor, (2017b) for details). Note that results were unchanged when calculating typicality scores using a canonical RDM based on the averaged RDM of the sex-, handedness-, and age matched control group tested in this study (see Appendix 1).”

      Revised text Appendix 1:

      “Typicality analysis using a canonical RDM based on the controls’ average RDM

      To ensure that our typicality results did not change when using a canonical inter-finger RDM based on the age-, sex-, and handedness matched subjects tested in this study, we recalculated the typicality scores of all participants using the averaged inter-finger RDM of our control sample as the canonical RDM. We found a strong and highly significant correlation between the typicality scores calculated using the canonical inter-finger RDM from the independent dataset (reported in the main manuscript) and the typicality scores calculated using our controls’ average RDM. This was true for both the SCI patient (rs = 0.92, p < 0.001) and control groups (rs = 0.78, p < 0.001).

      We then repeated all typicality analysis reported in the main manuscript. As expected, using the typicality scores calculated using our controls’ average RDM we found the same results as when using the canonical inter-finger RDM from the independent dataset: There was a significant difference in typicality between SCI patients, healthy controls, and congenital one-handers (H(2)=27.61, p < 0.001). We further found significantly higher typicality in controls compared to congenital one-handers (U=0, p < 0.001; BF10=76.11). Importantly, the typicality scores of the SCI patients were significantly higher than the congenital one-handers (U=2, p < 0.001; BF10=50.98), but not significantly different from the controls (U=94, p=0.24; BF10=0.55). Number of years since SCI significantly correlated with hand representation typicality (rs=-0.54, p=0.05) and patients with more retained GRASSP motor function of the tested upper limb had more typical hand representations in S1 (rs=0.58, p=0.03). There was no significant correlation between S1 hand representation typicality and GRASSP sensory function of the tested upper limb, spared midsagittal spinal tissue bridges at the lesion level, or cross-sectional spinal cord area (rs=0.40, p=0.15, rs=0.50, p=0.10, and rs=0.48, p=0.08, respectively). An exploratory stepwise linear regression analysis revealed that years since SCI significantly predicted hand representation typicality in S1 with R2=0.33 (F(1,10)=4.98, p=0.05). Motor function, sensory function, spared midsagittal spinal tissue bridges at the lesion level, and spinal cord area did not significantly add to the prediction (t=1.31, p=0.22, t=1.62, p=0.14, t=1.70, p=0.12, and t=1.09, p=0.30, respectively).”

      • Methods & Results: The authors make a few unproven claims, such as saying "generally, the position, order of finger preference, and extent of the hand maps were qualitatively similar between patients and control". There are no data to support these claims.

      As indicated in this sentence, this claim substantiated a qualitative inspection of the finger maps in Figure 2 and we indeed do not support this claim with quantitative analysis. We have therefore removed this sentence from the revised manuscript and instead say, as per the suggestion of reviewer 1, that overall, there were aspects of somatotopic finger selectivity in the SCI patients’ hand maps,

      Revised text Results:

      “Overall, we found aspects of somatotopic finger selectivity in the maps of SCI patients’ hands, in which neighbouring clusters showed selectivity for neighbouring fingers in contralateral S1, similar to those observed in eighteen age-, sex-, and handedness matched healthy controls (see Figure 2A&B). A characteristic hand map shows a gradient of finger preference, progressing from the thumb (red, laterally) to the little finger (pink, medially). Notably, a characteristic hand map was even found in a patient who suffered complete paralysis and sensory deprivation of the hands (Figure 2. patient map 1; patient S01). Despite most maps (Figure 2, except patient map 3) displaying aspects of characteristic finger selectivity, some finger representations were not visible in the thresholded patient and control maps.”

      • Methods & Results: The authors argue that the map architecture is topographic as soon as the dissimilarity between two different fingers is above 0. First, what I am really wondering about is why the authors do not provide the exact dissimilarity values in the text but only give the stats for the difference to 0 (t-value, p-value, Bayes factor). Were the dissimilarity values perhaps very low? The values should be reported. Also, when this argument that maps are topographic as long as the value of two different fingers is above 0 should hold, then the authors have to show that the value for mapping the SAME finger is indeed 0. Otherwise, this argument is not convincing.

      We would like to clarify that a representation is not per se topographic when the RSA dissimilarity is > 0. The dissimilarity value provided by RSA indicates the extent to which a pair of conditions is distinguished – it can be viewed as encapsulating the information content carried by the region (Kriegeskorte et al., 2008). Due to cross-validation across runs, the expected distance value would be zero (but can go below 0) if two conditions’ activity patterns are not statistically different from each other, and larger than zero if there is differentiation between the conditions (fingers’ activity patterns in the S1 hand area in our case; Kriegeskorte et al., 2008; Nili et al., 2014). The diagonal of the RDM reflect comparisons between the same fingers and therefore reflect distances between the exact same activity pattern in the same run and are thus 0 by definition (Kriegeskorte et al., 2008; Nili et al., 2014). This was also the case in our individual participant RDMs. Since this is not a meaningful value (a distance between 2 identical activity patterns will always be 0) we chose not to report this. We have clarified the meaning of the separability measure in the revised Methods section.

      To investigate whether a representation is somatotopic, we have to take into account the full fine-grained inter-finger distance pattern. The full fine-grained inter-finger distance pattern is related to everyday use of our hand and has been shown to depict the invariant representational structure of fingers better than the size, shape, and exact location of the areas activated by finger movements (Ejaz et al., 2015). To determine whether a participant’s inter-finger distance pattern is somatotopic one should associate it to a canonical RDM – which is done in the typicality analysis (see also our response to comment 2.6).

      What can be done to demonstrate the validity of an ROI, is to run RSA on a control ROI where one would not expect to find activity that is distinguishable between finger conditions. Rather than comparing your separability measure against 0, one can then compare the separability of your ROI that is expected to contain this information to that of your control ROI. We created a cerebral spinal fluid (CSF) ROI, repeated our RSA analysis in this ROI, and then compared the separability of the CSF and S1 hand area ROIs. As expected, there was a significant difference between separability (or representation strength) in the S1 hand area and CSF ROIs for both controls (W=171, p < 0.001; BF10=4059) and patients (W=105, p < 0.00; BF10=279). This analysis has been added to the revised manuscript.

      Individual participant separability values (i.e., distances averaged across fingers) are visualised in Figure 3D. Following the reviewer’s suggestion, we have included individual participant inter-finger distance plots for both the controls and SCI patients as Figure 3- Figure supplement 2 and Figure 3- figure supplement 3, respectively. The inter-finger distances for each finger pair and subject can be extracted from this. We feel this is more readily readable and interpretable than a table containing the 10 inter-finger distance scores for all 32 participants. These values have instead been made available online, together with our other data, on https://osf.io/e8u95/.

      Revised text Methods:

      “If there is no information in the ROI that can statistically distinguish between the finger conditions, then due to cross-validation the expected distance measure would be 0. If there is differentiation between the finger conditions, the separability would be larger than 0 (Nili et al., 2014). Note that this does not directly indicate that this region contains topographic information, but rather that this ROI contains information that can distinguish between the finger conditions. To further ensure that our S1 hand ROI was activated distinctly for different fingers, we created a cerebral spinal fluid (CSF) ROI that would not contain finger specific information. We then repeated our RSA analysis in this ROI and statistically compared the separability of the CSF and S1 hand area ROIs.”

      Revised text Results:

      “We found that inter-finger separability in the S1 hand area was greater than 0 for patients (t(13) = 9.83, p < 0.001; BF10 = 6.77e +4) and controls (t(17) = 11.70, p < 0.001; BF10 = 6.92e +6), indicating that the S1 hand area in both groups contained information about individuated finger representations. Furthermore, for both controls (W = 171, p < 0.001; BF10 = 4059) and patients (W = 105, p < 0.001; BF10 = 279) there was significant greater separability (or representation strength) in the S1 hand area than in a control cerebral spinal fluid ROI that would not be expected to contain finger specific information. We did not find a significant group difference in inter-finger separability of the S1 hand area (t(30) = 1.52, p = 0.14; BF10 = 0.81), with the BF showing anecdotal evidence in favour of the null hypothesis.”

      • Discussion. The authors argue that spared midsagittal spinal tissue bridges are not necessary because they were not predictive of hand representation typicality. First, the measure of typicality is questionable and should not be used to make general claims about the importance of structural differences. Second, given there were only n=14 patients included, one may question generally whether predictive modelling can be done with these data. This statement should therefore be removed.

      We would like to clarify that, like the reviewer, we do not believe that spared midsagittal spinal tissue bridges are unimportant. Indeed, a large body of our own research focuses on the importance of spared spinal tissue bridges in recovery of sensorimotor function and pain (Huber et al., 2017; Pfyffer et al., 2021, 2019; Vallotton et al., 2019). We have added a clarification sentence regarding the importance of tissue bridges with regards to recovery of function. We agree with the reviewer that given our limited sample size, it is difficult to make conclusive claims based on non-significant predictive modelling and correlational results. In the revised manuscript we therefore focus this statement (i.e., that sensory and motor hand function and tissue bridges are not necessary to preserve hand somatotopy) on our finding that two patients without spared tissue bridges at the lesion level and with complete or near complete loss of sensory and motor hand function had a highly typical hand representation. We present our predictive modelling results as being in line with this notion and added a word of caution that it is challenging to draw definite conclusions from non-significant predictive modelling and correlation results in such a limited sample size.

      With regards to the reviewer’s concern about the validity of the typicality measure – please see our detailed response to comment 2.6.

      Revised text Discussion:

      “Spared spinal cord tissue bridges can be found in most patients with a clinically incomplete injury, their width being predictive of electrophysiological information flow, recovery of sensorimotor function, and neuropathic pain (Huber et al., 2017; Pfyffer et al., 2021, 2019; Vallotton et al., 2019). However, in this study, spared midsagittal spinal tissue bridges at the lesion level and sensorimotor hand function did not seem necessary to maintain and activate a somatotopic hand representation in S1. We found a highly typical hand representation in two patients (S01 and S03) who did not have any spared spinal tissue bridges at the lesion level, a complete (S01) or near complete (S03) hand paralysis, and a complete (S01) or near complete loss (S03) of hand sensory function. Our predictive modelling results were in line with this notion and showed that these behavioural and structural spinal cord determinants were not predictive of hand representation typicality. Note however that our sample size was limited, and it is challenging to draw definite conclusions from non-significant predictive modelling results.”

      • Discussion. The authors say that hand representation is "preserved" in SCI patients. Perhaps it is better to be precise and to say that they active during movement planning.

      We thank the reviewer for their suggestion and revised the Discussion accordingly.

      Revised text Discussion:

      "In this study we investigated whether hand somatotopy is preserved and can be activated through attempted movements following tetraplegia."

      "How may these representations be preserved over time and activated through attempted movements in the absence of peripheral information?"

      "Together, our findings indicate that in the first years after a tetraplegia, the somatotopic S1 hand representation is preserved and can be activated through attempted movements even in the absence of retained sensory function, motor function, and spared spinal tissue bridges."

      Reviewer #3:

      The demonstration that cortex associated with an amputated limb can be activated by other body parts after amputation has been interpreted as evidence that the deafferented cortex "reorganizes" and assumes a new function. However, other studies suggest that the somatotopic organization of somatosensory cortex in amputees is relatively spared, even when probed long after amputation. One possibility is that the stability is due to residual peripheral input. In this study, Kikkert et al. examine the somatotopic organization of somatosensory cortex in patients whose spinal cord injury has led to tetraplegia. They find that the somatotopic organization of the hand representation of somatosensory cortex is relatively spared in these patients. Surprisingly, the amount of spared sensorimotor function is a poor predictor of the stability of the patients' hand somatotopy. Nonethless, the hand representation deteriorates over decades after the injury. These findings contribute to a developing story on how sensory representations are formed and maintained and provide a counterpoint to extreme interpretations of the "reorganization" hypothesis mentioned above. Furthermore, the stability of body maps in somatosensory cortex after spinal cord injury has implications for the development of brain-machine interfaces.

      I have only minor comments:

      1) Given the controversy in the field, the use of the phrase "take over the deprived territory" (line 45) is muddled. Perhaps a more nuanced exposition of this phenomenon is in order?

      We agree a more nuanced expression would be more appropriate. We have changed this sentence accordingly in the revised manuscript.

      Revised text Introduction:

      “Seminal research in nonhuman primate models of SCI has shown that this leads to extensive cortical reorganisation, such that tactile stimulation of cortically adjacent body parts (e.g. of the face) activated the deprived brain territory (e.g. of the hand; Halder et al., 2018; Jain et al., 2008; Kambi et al., 2014).”

      2) The statement that "results are mixed" regarding intracortical microstimulation of S1 is dubious. In only one case has the hand representation been mislocalized, out of many cases (several at CalTech, 3 at the University of Pittsburgh, one at Case Western, one at Hopkins/APL, and one at UChicago). Perhaps rephrase to "with one exception?"

      We agree that this sentence may give a wrong outlook on the literature and have changed the text per the reviewer’s suggestion.

      Revised text Introduction:

      “Case studies using intracortical stimulation in the S1 hand area to elicit finger sensations in SCI patients hint at such preserved somatotopic representations (Fifer et al., 2020; Flesher et al., 2016), with one exception (Armenta Salas et al., 2018).”

      3) The phrase "tetraplegic sinal cord injury" seems awkward.

      Thank you for highlighting this to us. We have corrected these instances in our revised manuscript to “tetraplegia”.

      4) The stability of the representation is attributed to efference copy from M1. While this is a fine speculation, somatosensory cortex is part of a circuit and is interconnected with many other brain areas, M1 being one. Perhaps the stability is maintained due to the position of somatosensory cortex within this circuit, and not solely by its relationship with M1? There seems to be an overemphasis of this hypothesis at the exclusion of others.

      Thank you for this comment. We agree we overemphasized the efference copy theory. We have adjusted this and now provide a more balanced description of potential circuits and interconnections that could maintain somatotopic representations after tetraplegia.

    1. Author Response

      Reviewer #1 (Public Review):

      It is well established that valuation and value-based decision-making is context-dependent. This manuscript presents the results of six behavioral experiments specifically designed to disentangle two prominent functional forms of value normalization during reward learning: divisive normalization and range normalization. The behavioral and modeling results are clear and convincing, showing that key features of choice behavior in the current setting are incompatible with divisive normalization but are well predicted by a non-linear transformation of range-normalized values.

      Overall, this is an excellent study with important implications for reinforcement learning and decision-making research. The manuscript could be strengthened by examining individual variability in value normalization, as outlined below.

      We thank the Reviewer for the positive appreciation of our work and for the very relevant suggestions. Please find our point-by-point answer below.

      There is a lot of individual variation in the choice data that may potentially be explained by individual differences in normalization strategies. It would be important to examine whether there are any subgroups of subjects whose behavior is better explained by a divisive vs. range normalization process. Alternatively, it may be possible to compute an index that captures how much a given subject displays behavior compatible with divisive vs. range normalization. Seeing the distribution of such an index could provide insights into individual differences in normalization strategies.

      Thank you for pointing this out, it is indeed true that there is some variability. To address this, and in line with the Reviewer’s suggestion, we extracted model attributions per participant on the individual out-of-sample log-likelihood, using the VBA_toolbox in Matlab (Daunizeau et al., 2014). In experiment 1 (presented in the main text), we found that the RANGE model accounted for 79% of the participants, while the DIVISIVE model accounted for 12%. The relative difference was even higher when including the RANGEω model in the model space: the RANGE and RANGEω models account for a total of 85% of the participants, while the DIVISIVE model accounted only for 5%.

      In experiment 2 (presented in the supplementary materials), the results were comparable (see Figure 3-figure supplement 3: 73% vs 10%, 83% vs 2%).

      To provide further insights into the behavioral signatures behind inter-individual differences, we plotted the transfer choice rates for each group of participants (best explained by the RANGE, DIVISIVE, or UNBIASED models), and the results are similar to our model predictions from Figure 1C:

      Author Response Image 1. Behavioral data in the transfer phase, split over participants best explained by the RANGE (left), DIVISIVE (middle) or UNBIASED (right) model in experiment 1 (A) and experiment 2 (B) (versions a, b and c were pooled together).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      One possibility currently not considered by the authors is that both forms of value normalization are at work at the same time. It would be interesting to see the results from a hybrid model. R1.2 Thank you for the suggestion, we fitted and simulated a hybrid model as a weighted sum between both forms of normalization:

      First, the HYBRID model quantitatively wins over the DIVISIVE model (oosLLHYB vs oosLLDIV : t(149)=10.19, p<.0001, d=0.41) but not over the RANGE model, which produced a marginally higher log-likelihood (oosLLHYB vs oosLLRAN : t(149)=-1.82, p=.07, d=-0.008). Second, model simulations also suggest that the model would predict a very similar (if not worse) behavior compared to the RANGE model (see figure below). This is supported by the distribution of the weight parameter over our participants: it appears that, consistently with the model attributions presented above, most participants are best explained by a range-normalization rule (weight > 0.5, 87% of the participants, see figure below). Together, these results favor the RANGE model over the DIVISIVE model in our task.

      Out of curiosity, we also implemented a hybrid model as a weighted sum between absolute (UNBIASED model) and relative (RANGE model) valuations:

      Model fitting, simulations and comparisons slightly favored this hybrid model over the UNBIASED model (oosLLHYB vs oosLLUNB: t(149)=2.63, p=.0094, d=0.15), but also drastically favored the range normalization account (oosLLHYB vs oosLLRAN : t(149)=-3.80, p=.00021, d=-0.40, see Author Response Image 2).

      Author Response Image 2. Model simulations in the transfer phase for the RANGE model (left) and the HYBRID model (middle) defined as a weighted sum between divisive and range forms of normalization (top) and between unbiased (no normalization) and range normalization (bottom). The HYBRID model features an additional weight parameter, whose distribution favors the range normalization rule (right).

      To keep things concise, we did not include this last figure in the revised manuscript, but it will be available for the interested readers in the Rebuttal letter.

      Reviewer #2 (Public Review):

      This paper studies how relative values are encoded in a learning task, and how they are subsequently used to make a decision. This is a topic that integrates multiple disciplines (psych, neuro, economics) and has generated significant interest. The experimental setting is based on previous work from this research team that has advanced the field's understanding of value coding in learning tasks. These experiments are well-designed to distinguish some predictions of different accounts for value encoding. However there is an additional treatment that would provide an additional (strong) test of these theories: RN would make an equivalent set of predictions if the range were equivalently adjusted downward instead (for example by adding a "68" option to "50" and "86", and then comparing to WB and WT). The predictions of DN would differ however because adding a low-value alternative to the normalization would not change it much. Would the behaviour of subjects be symmetric for equivalent ranges, as RN predicts? If so this would be a compelling result, because symmetry is a very strong theoretical assumption in this setting.

      We thank the Reviewer for the overall positive appraisal concerning our work, but also for the stimulating and constructive remarks that we have addressed below. At this stage, we just wanted to mention that we also agree with the Reviewer concerning the fact that a design where we add "68" option to "50" and "86" would represent also an important test of our hypotheses. This is why we had, in fact, run this experiment. Unfortunately, their results were somehow buried in the Supplementary Materials of our original submission and not correctly highlighted in the main text. We modified the manuscript in order to make them more visible:

      Behavioral results in three experiments (N=50 each) featuring a slightly different design, where we added a mid value option (NT68) between NT50 and NT87 converge to the same broad conclusion: the behavioral pattern in the transfer phase is largely incompatible with that predicted by outcome divisive normalization during the learning phase (Figure 2-figure supplement 2).

      Reviewer #3 (Public Review):

      Bavard & Palminteri extend their research program by devising a task that enables them to disassociate two types of normalisation: range normalisation (by which outcomes are normalised by the min and max of the options) and divisive normalisation (in which outcomes are normalised by the average of the options in ones context). By providing 4 different training contexts in which the range of outcomes and number of options vary, they successfully show using 'ex ante' simulations that different learning approaches during training (unbiased, divisive, range) should lead to different patterns of choice in a subsequent probe phase during which all options from the training are paired with one another generating novel choice pairings. These patterns are somewhat subtle but are elegantly unpacked. They then fit participants' training choices to different learning models and test how well these models predict probe phase choices. They find evidence - both in terms of quantitive (i.e. comparing out-of-sample log-likelihood scores) and qualitative (comparing the pattern of choices observed to the pattern that would be observed under each mode) fit - for the range model. This fit is further improved by adding a power parameter which suggests that alongside being relativised via range normalisation, outcomes were also transformed non-linearly.

      I thought this approach to address their research question was really successful and the methods and results were strong, credible, and robust (owing to the number of experiments conducted, the design used and combination of approaches used). I do not think the paper has any major weaknesses. The paper is very clear and well-written which aids interpretability.

      This is an important topic for understanding, predicting, and improving behaviour in a range of domains potentially. The findings will be of interest to researchers in interdisciplinary fields such as neuroeconomics and behavioural economics as well as reinforcement learning and cognitive psychology.

      We thank Prof. Garrett for his positive evaluation and supportive attitude.

    1. Author Response

      Reviewer #1 (Public Review):

      In this paper, Fernandes et al. take advantage of synthetic constructs to test how Bicoid (Bcd) activates its downstream target Hunchback (Hb). They explore synthetic constructs containing only Bcd, Bcd and Hb, and Bcd and Zelda binding sites. They use these to develop theoretical models for how Bcd drives Hb in the early embryo. They show that Hb sites alone are insufficient to drive further Hb expression.

      The paper's first half focuses on how well the synthetic constructs replicate the in vivo expression of hb. This approach is generally convincing, and the results are interesting. Consistent with previous work, they show that Bcd alone is sufficient to drive an expression profile that is similar to wild‐type, but the addition of Hb and Zelda are needed to generate precise and rapid formation of the boundaries. The experimental results are supported by modelling. The model does a nice job of encapsulating the key conclusions and clearly adds value to the analysis.

      In the second part of the paper, the authors use their synthetic approach to look at how the Hb boundary alters depending on Bcd dosage. This part asks whether the observed Bcd gradient is the same as the activity gradient of Bcd (i.e. the "active" part of Bcd is not a priori the same as the protein gradient). This is a very interesting problem and good the authors have tried to tackle this. However, the strength of their conclusions needs to be substantially tempered as they rely on an overestimation of the Bcd gradient decay length.

      Comments:

      ‐ My major concern regards the conclusions for the final section on the activity gradient. In the Introduction it is stated: "[the Bcd gradient has] an exponential AP gradient with a decay length of L ~ 20% egg‐length (EL)". While this was the initial estimate (Houchmandzadeh et al., Nature 2002), later measurements by the Gregor lab (see Supplementary Material of Liu et al., PNAS 2013) found that "The mean length constant was reduced to 16.5 ± 0.7%EL after corrections for EGFP maturation". The original measurements by Houchmandzadeh et al. had issues with background control, that also led to the longer measured decay length. In later work, Durrieu et al., Mol Sys Biol 2018, found a similar scale for the decay length to Liu et al. Looking at Figure 5, a value of 16.5%EL for the decay length is fully consistent with the activity and protein gradients for Bcd being similar. In short, the strength of the conclusions clearly does not match the known gradient and should be substantially toned down.

      The reviewer is right: several studies aiming to quantitatively measure the Bicoid protein gradient ended‐up with quite different decay lengths.

      A summary of the various decay lengths measured, and the method used for these measurements is given below:

      As indicated, these measurements are quite variable among the different studies and the differences can potentially be attributed to different methods of detection (antibody staining on fixed samples vs fluorescent measurements on live sample) or to the type of protein detected (endogenous Bicoid vs fluorescently tagged).

      We agree with the reviewer that given these discrepancies, the exact value of the Bcd protein gradient decay length is not known and that we only have measurements that put it in between 16 and 25 % EL (see the Table above). Therefore, we agree that we should tone down the difference between the protein vs activity gradient and focus on the measurements of the effective activity gradient decay length allowed by our synthetic reporters. This allows us to revisit the measurement of the Hill coefficient of the transcription step‐like response, which is based on the decay‐length for the Bcd protein gradient, and assumed in previous published work to be of 20% EL (Gregor et al., Cell, 2007a; Estrada et al., 2016; Tran et al., PLoS CB, 2018). Importantly, the new Hill coefficient allows us to set the Bcd system within the limits of an equilibrium model.

      As mentioned by the reviewer, it is possible that the decay length of the protein gradient measured using antibody staining (Houchmandzadeh et al,, Nature, 2002) was not correct due to background controls. Such measurements were also performed in Xu et al. (2015) which agree with the original measurements (Houchmandzadeh et al., Nature 2002). As indicated in the table above, all the other measurements of the Bcd protein gradient decay length were done using fluorescently tagged Bcd proteins and we cannot exclude the possibility the wt vs tagged protein might have different decay lengths due to potentially different diffusion coefficients or half‐lives. Before drawing any conclusion on the exact value of the endogenous Bcd protein gradient decay length, it is essential to measure it again in conditions that correct for the background issues for immuno‐staining as it was done in Liu et al., PNAS, 2013 for the Bcd‐eGFP protein. In this study, the authors only measured the decay length of the Bcd fusion protein using immuno‐staining for the Bcd protein. Unfortunately, in this study, the authors did not measure again the decay length of the endogenous Bcd protein gradient using immuno‐staining and the same procedure for background control. Therefore, they do not firmly exclude the possibility that the endogenous vs tagged Bcd proteins might have different decay length.

      We thank the reviewer for his comment which helped us to clarify the message. In addition, as there is clearly an issue for the measurements of the Bcd protein gradient, we added a section in the SI (Section E) and a Table (Table S4) describing the various decay length measured for the Bcd or the Bcd‐fluorescently tagged protein gradients from previous studies. In the discussion, together with the possibility that there might be a protein vs activity gradient (as we originally proposed and believe is still a valid possibility), we also discuss the alternative possibility proposed by the reviewer which is that the protein vs activity gradients have the same decay lengths but that the decay length of the Bcd protein gradient was potentially not correctly evaluated.

      ‐ All of the experiments are performed in a background with the hb gene present. Does this impact on the readout, as the synthetic lines are essentially competing with the wild‐type genes? What controls were done to account for this?

      We agree with the reviewer that this concern might be particularly relevant at the hb boundary where a nucleus has been shown to only contain ~ 700 Bicoid molecules (Gregor et al., Cell, 2007b). However, ~1000 Bicoid binding regions have been identified by ChIP seq experiments in nc14 embryos (Hannon et al., Elife, 2017) and given that several Bcd binding sites are generally clustered together in a Bcd region, the number of Bcd binding sites in the fly genome is likely larger than 1000. It is much greater than the number of Bicoid binding sites in our synthetic reporters. Therefore, we think that it is unlikely that adding the synthetic reporters (which in the case of B12 only represents at most 1/100 of the Bcd binding sites in the genome) will severely alter the competition for Bcd binding between the other Bcd binding sites in the genome. Additionally, the insertion of a BAC spanning the endogenous hb locus with all its Bcd‐dependent enhancers did not affect (as far as we can tell) the regulation of the wildtype gene (Lucas, Tran et al., 2018).

      We have added a sentence concerning this point in the main text (lines 108 to 111).

      ‐ Further, the activity of the synthetic reporters depends on the location of insertion. Erceg et al. PLoS Genetics 2014 showed that the same synthetic enhancer can have different readout depending on its genomic location. I'm aware that the authors use a landing site that appears to replicate similar hb kinetics, but did they try random insertion or other landing site? In short, how robust are their results to the specific local genome site? This should have been tested, especially given the boldly written conclusions from the work.

      This concern of the reviewer has been tested and is addressed Fig S1 where we compare two random insertions of the hb‐P2 transgene (on chromosome II and III; Lucas, Tran et al., 2018) and the insertion at the VK33 landing site that was used for the whole study. As shown Fig. S1, the dynamics of transcription (kymographs) are very similar. In the main text, the reference Fig. S1 is found in the Materials and Methods section (bottom of the 1st paragraph concerning the Drosophila stocks, lines 518).

      ‐ Related to the above, it's also not obvious that readout is linear ‐ i.e. as more binding sites are added, there could be cooperativity between binding domains. This may have been accounted for in the model but it is not clear to me how.

      The reviewer is totally correct. It is clear from our data that readout is not linear: comparing (increase of 1.5 X in the number of BS) B6 with B9 leads to a 4.5 X greater activation rate and this argues against independent activation of transcription by individual bound Bcd TF. There is almost no impact of adding 3 more sites when comparing B9 to B12 (even though it corresponds to an increase of 1.33 X in the number of BS). This issue has been rephrased in the main text (lines 200 to 203) and further developed for the modeling aspects in the SI section C and Figure S3. It is also discussed in the second paragraph of the discussion (lines 380 to 383).

      ‐ It would be good in the Introduction/Discussion to give a broader perspective on the advantages and disadvantages of the synthetic approach to study gene regulation. The intro only discusses Tran et al. Yet, there is a strong history of using this approach, which has also helped to reveal some of the approaches shortcoming. E.g. Gertz et al. Nature 2009 and Sharon et al. Nature Biotechnology 2012. Again, I may have missed, but from my reading I cannot see any critical analysis of the pros/cons of the synthetic approach in development. This is necessary to give readers a clearer context.

      One sentence was added in the introduction concerning this point (lines 79 to 82).

      A short review concerning the synthetic approach in development has also been added at the beginning of the discussion (lines 347 to 359).

      Reviewer #2 (Public Review):

      It is known that Bicoid increases in concentration across the syncytial division cycles, the gradient length scale for Bicoid does not change, and hunchback also increases in concentration during the syncytial cycles but the sharp boundary of the hunchback gradient is constantly seen despite the change in concentration of Bicoid. This manuscript shows that by increasing the Bicoid concentration or by adding Zelda binding sites, the expression of hunchback can be recapitulated to that of a previously studied promoter for hunchback.

      I have the following comments to understand the implications of the study in the context of increasing concentrations of Bicoid during the syncytial division cycles:

      ‐ Bicoid itself is also increasing over the syncytial division cycles, how does this change in concentration of Bicoid affect the activation of the hunchback promoter given the cooperative binding of Bicoid and Bicoid and Zelda as documented by the study?

      We thank the reviewer for this remark about the dynamics of the Bcd gradient, which we may have taken for granted. A seminal work on the dynamics of the Bcd gradient using fluorescent‐tagged Bcd (Gregor et al, Cell, 2007a) has shown that the gradient of Bcd nuclear concentration (this nuclear concentration is the one that matter for transcription) remains stable over nuclear cycles, despite a global increase of Bcd amount in the embryo. This can be explained by the fact that Bcd molecules are imported in the nuclei and that the number of nuclei double at every cycle, such that both processes compensate each other. Thus, we assumed that the gradient of Bcd nuclear concentration was stable over nc11 to nc13.

      We have clarified this assumption in the model section in the manuscript (lines 165‐168).

      Supporting our assumption, when looking at the transcription dynamics regulated by Bcd, in Lucas et al, PLoS Gen, 2018, we observed very reproducible expression pattern dynamics of the hb‐P2 reporter at each cycle nc11 to nc13. Such reproducibility in the pattern dynamics were also observed in this current work for hb‐P2, B6, B9, B12 and H6B6 reporters (Fig. S6A). Also, in Lucas et al, PLoS Gen, 2018, the shift in the established boundary positions of hb‐P2 reporter between nc11 to nc13 is ~2%EL (approximately a nucleus length ~10μm) and it is thus marginal.

      In addition, as mentioned in the text (lines 105 to 107), we only focused our analysis on nc13 data which are statistically stronger given the higher number of nuclei analyzed. Thus, any change of Bcd nuclear concentration that would happen over nuclear cycles will not matter.

      Concerning Zelda: Zelda’s transcriptional activity when measured on a reporter with only 6 Zld binding sites changes drastically over the nuclear cycles, with strong activity at nc11 and much weaker activity at nc13 (Fig S4A). This indicates that the changes in expression pattern dynamics of Z2B6 from nc11 to nc13 are caused predominantly by decreasing Zelda activity: the effect of Zld on the Z2B6 promoter is very strong during nc11 and nc12. It is also very strong at the beginning of nc13 (even though the Z6 reporter is almost silent) and became a bit weaker in the second part of nc13 (Fig S4B‐D).

      ‐ Does the change in concentration of Bicoid across the nuclear cycles shift the gradient similar to the change in numbers of Bicoid binding sites?

      In both Lucas et al, PLoS Gen, 2018 and in this work (Fig. 1, Fig. 3 and Fig. S6A), we found that the positions of the expression boundary are very reproducible and stable in time for hb‐P2, B6, B9, B12, H6B6 during the interphase of nc12 to 13. For hb‐P2, the averaged shift of the established boundary position in nc11, 12 and 13 is within 2 %EL. This averaged shift between the cycles is of similar magnitude to the difference caused by embryo‐to‐embryo variability within nc13 (~2 %EL) (Gregor et al, Cell, 2007b, Lucas et al, PloS Gen, 2018). This shift is much smaller than the difference between the expression boundary positions of B6 and B9 (~ 8 % EL) and between B6 and Z2B6 (~17.5 %EL) in nc13.

      For these reasons, we conclude that the difference between the expression patterns of B6, B9 and Z2B6 are caused predominantly by changing the TF binding site configurations of the reporters, rather than variability in the Bcd gradient.

      The assumption of gradient stability has been clarified in the previous answer and in the manuscript (lines 165‐168).

      ‐ The intensity is a little higher for B9 and B12 at the anterior in 2B? Is this statistically different? is this likely to change the amount of Bicoid expression at the locus and lead to more robust activation?

      We performed statistical tests to distinguish the spot intensities at the anterior pole for every pair of reporters in Fig. 2B (hb‐P2, B6, B9 and B12). All p‐values from pair‐wise KS tests are greater than 0.067, suggesting that the spot intensities at the anterior pole are not distinguishable between these reporters.

      We have clarified this in the manuscript (line 157).

      ‐Are the fraction of active loci not changing across the syncytial cycles when the concentration of Bicoid also changes and consistent with the synthetic promoters?

      To measure the reproducibility of the expression pattern dynamics in different nuclear cycles, we compared the boundary position of the fraction of active loci pattern as a function of time for all hbP2 and synthetic reporters (Fig. S6A). In this figure panel, for all reporters except Z2B6, the curves in nc12 and nc13 largely overlap, suggesting high reproducibility in the pattern dynamics between cycles and consequently low sensitivity to the subtle variation in the Bcd nuclear concentration gradient between the cycles.

      For Z2B6, we attributed the difference in pattern dynamics between nc12 and nc13 to the changes in Zelda activity, as validated independently with a synthetic reporter with only 6 Zld binding sites (Fig. S4A).

      ‐How do the numbers of Hb BS change the expression of Hb? H6B6 has 6 Hb BS whereas the Hb‐P2 has 1? Are more controls needed to compare these 2 contexts?

      As our goal was to determine to which mechanistic step of our model each TF (Bcd, Hb, Zld) contributed, we added BS numbers that are much higher than in the hb‐P2 promoter. The added number of Hb BS remains very low when compared to total number of Hb binding sites in the entire genome (Karplan et al, PLOS Gen, 2011), therefore, it is very unlikely to affect the endogenous expression of Hb protein.

      We clarified this in the manuscript (lines 211 to 212).

      Does Zelda concentration change across the syncytial division cycles? How does the change in concentration in the natural context affect the promoter activation of Hb?

      Zelda concentration is stable over the nuclear cycles, as observed with the fluorescently‐tagged Zld protein (Dufourt et al., Nat Com, 2018). However, Zelda’s transcriptional activity when measured on a reporter with only 6 Zld binding sites changes drastically over the nuclear cycles, with strong activity at nc11 and much weaker activity at nc13 (Fig S4A, this work).

      The impact of this change in Zld activity can be observed with the Z2B6 promoter, with the expression boundary moving from the posterior region toward the anterior region over the nuclear cycles (Fig. S4B‐D). However, we don’t detect any changes in the expression pattern dynamics of hb‐P2 over the nuclear cycles (Fig. S6A and in Lucas et al., PLoS Gen, 2018).

      We have clarified this in lines 250‐251 of the main manuscript.

      ‐Changing the dose of Bicoid shifts the boundary of hunchback expression. It would be nice to model or test this in the context of varing doses of zelda or even reason this with respect to varying doses of zelda across the syncytial division cycles.

      We thank the reviewer for this insight. Concerning Zelda, we did not perform any experiment reducing the amount of Zelda in the embryo. However, in a previous study (Lucas et al., PLoS Genetics, 2018), we observed that the boundary of hb was shifted towards the anterior when decreasing the amount of Zelda consistent to the fact that the dose of Zelda is critical to set the boundary position and the threshold of Bcd concentration required for activation. However, as Zelda is distributed homogeneously along the AP axis, it cannot bring per se positional information to the system.

      Reviewer #3 (Public Review):

      I think the framing could be improved to better reflect the contribution of the work. From the abstract, for example, it's unclear to me what the authors think is the most meaningful conclusion. Is it the observations about the finer details of TF regulation (bursting dynamics), the fact that Bcd is probably the sole source of "positional information" for hb‐p2, that Bcd exists in active/inactive form, or the fact that an equilibrium model probably suffices to explain what we observe? The first sentence itself seems to suggest this paper will discuss "dynamic positional information", in which case it's somewhat misleading to say this kind of work is "largely unexplored"; Johannes Jaeger in particular has been a strong proponent of this view since at least 2004. On that note some particularly relevant recent papers in the Drosophila early embryo include:

      1) Jaeger and Verd (2020) Curr Topics Dev Biol

      2) Verd et al. (2017) PLoS Comp Biol

      3) Huang, Amourda, et al. and Saunders (2017) eLife

      4) Yang, Zhu, et al. (2020) eLife [see also the second half of Perkins (2021) PLoS Comp Biol for further discussion of that model]

      ‐Some reviews from James Briscoe also discuss this perspective.

      We agree with the reviewer that the phrasing of the abstract was not clear enough to emphasize the contribution of the work and we are also sorry if it suggested that the dynamic positional information is largely unexplored because this was not at all our intention.

      We rephrased the abstract aiming to better highlight the most meaningful conclusions.

      ‐I would also recommend modifying the title to reflect the biology found in the new results.

      We modified the title to better reflect the new results:<br /> “Synthetic reconstruction of the hunchback promoter specifies the role of Bicoid, Zelda and Hunchback in the dynamics of its transcription”

      ‐A major point that the authors should address is the design of the synthetic constructs. From table S1, the sites are often very closely linked (4‐7 base pairs). From the footprint of these proteins, we know they can cover DNA across this size (see, https://pubmed.ncbi.nlm.nih.gov/8620846/). As such, there may be direct competition/steric hindrance (see https://pubmed.ncbi.nlm.nih.gov/28052257/). What impact does this have on their interpretations? Note also that the native enhancer has spaced sites with variable identities.

      We completely agree with the reviewer comment in the sense that we named our reporters according to the number (N) of Bcd binding sites sequences that they contain, even though we cannot prove definitively that they can effectively be bound simultaneously by N Bcd molecules. It is thus possible that B9 is not a B9 but an effective B6 (i.e. B9 can only be bound simultaneously by 6 molecules) if, for instance, the binding of a Bcd molecule to one site would prevent by the binding of another Bcd molecule to a nearby site (as proposed by the reviewer in the case of direct competition or steric hindrance).

      Even though we cannot exclude this possibility, we think that our use of B6, B9, B12, in reference to the 6 Bcd BS of hb‐P2 promoter, is relevant for several reasons : i) some of the Bcd BS in the hb‐P2 promoter are also very close from each other (see Table S1); ii) the design of the synthetic construct was made by multimerizing a series of 3 strong Bcd binding sites with a similar spacing as found for the closest sites in the hb‐P2 promoter (as shown in Figure 1A and Table S1); iii) the binding of the Bicoid protein has been shown in foot printing experiments in vitro to be more efficient on sites of the hb‐P2 promoter that are close from each other, and this has even been interpreted as binding cooperativity (Ma et al., 1996); iv) even though these experiments were not performed with full‐length proteins, two molecules of the paired homeodomain (from the same family of DNA binding domain as Bcd) are able to simultaneously bind to two binding sites separated by only 2 base pairs. This binding to very close sites is even cooperative while when the two sites are distant by 5 base pairs or more, the simultaneous binding to the two sites occurs without cooperativity (Wilson et al., 1993).

      Conversely, as it is very difficult to demonstrate that 9 Bcd molecules can effectively bind to our B9 promoter, it is very difficult to know exactly how many binding sites for Bcd the hb‐P2 contains, and a large debate concerning not only the number but also the identity of the Bcd sites in the hb promoter is still ongoing (Park et al., 2019; Ling et al., 2019).

      As we cannot exclude the possibility that B9 is an effective B6, it remains possible that B9 and hb‐P2 (which is supposed to only contains 6 sites) have the same number of effective Bcd binding site and this could explain why the two reporters have very similar transcription dynamics and features.

      Regarding other interpretations in the manuscript, we identified two other aspects that will be affected if our synthetic reporters have fewer effective sites than the number of sites they carry. The first one concerns the synergy, as the increase in the number of sites of 1.5 from B6 to B9 might be over‐estimated but this would even increase the synergistic effect given the 4.5 difference in activity of the two reporters (Fig. S3). The second one concerns the discussion on the Hill coefficient and the decay length where the effective number of binding sites (N) is required to determine the limit of concentration sensing (Fig. 5). This would particularly be important for the hb‐P2 promoter.

      Except for these specific points, we don’t think that the possibility that reporters do not exactly contain as many as effective binding sites than proposed, has a huge impact on our interpretations and the general message conveyed in this manuscript. Most importantly, it is very clear that our B6 and B9 reporters differ only by three Bcd binding sites and have yet very distinct expression dynamics: while B9 recapitulates almost all transcription features of hb‐P2, B6 is far from achieving it. Similarly, H6B6 and Z2B6 have very different transcription features than B6 and these differences have been key for understanding the mechanistic functions of the three TF we studied.

      This discussion has been added to the discussion (lines 400 to 414)

    1. Author Response

      Reviewer #1 (Public Review):

      Overall, the science is sound and interesting, and the results are clearly presented. However, the paper falls in-between describing a novel method and studying biology. As a consequence, it is a bit difficult to grasp the general flow, central story and focus point. The study does uncover several interesting phenomena, but none are really studied in much detail and the novel biological insight is therefore a bit limited and lost in the abundance of observations. Several interesting novel interactions are uncovered, in particular for the SPS sensor and GAPDH paralogs, but these are not followed up on in much detail. The same can be said for the more general observations, eg the fact that different types of mutations (missense vs nonsense) in different types of genes (essential vs non-essential, housekeeping vs. stress-regulated...) cause different effects.

      This is not to say that the paper has no merit - far from it even. But, in its current form, it is a bit chaotic. Maybe there is simply too much in the paper? To me, it would already help if the authors would explicitly state that the paper is a "methods" paper that describes a novel technique for studying the effects of mutations on protein abundance, and then goes on to demonstrate the possibilities of the technology by giving a few examples of the phenomena that can be studied. The discussion section ends in this way, but it may be helpful if this was moved to the end of the introduction.

      We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      Schubert et al. describe a new pooled screening strategy that combines protein abundance measurements of 11 proteins determined via FACS with genome-wide mutagenesis of stop codons and missense mutations (achieved via a base editor) in yeast. The method allows to identify genetic perturbations that affect steady state protein levels (vs transcript abundance), and in this way define regulators of protein abundance. The authors find that perturbation of essential genes more often alters protein abundance than of nonessential genes and proteins with core cellular functions more often decrease in abundance in response to genetic perturbations than stress proteins. Genes whose knockouts affected the level of several of the 11 proteins were enriched in protein biosynthetic processes while genes whose knockouts affected specific proteins were enriched for functions in transcriptional regulation. The authors also leverage the dataset to confirm known and identify new regulatory relationships, such as a link between the SDS amino acid sensor and the stress response gene Yhb1 or between Ras/PKA signalling and GAPDH isoenzymes Tdh1, 2, and 3. In addition, the paper contains a section on benchmarking of the base editor in yeast, where it has not been used before.

      Strengths and weaknesses of the paper

      The authors establish the BE3 base editor as a screening tool in S. cerevisiae and very thoroughly benchmark its functionality for single edits and in different screening formats (fitness and FACS screening). This will be very beneficial for the yeast community.

      The strategy established here allows measuring the effect of genetic perturbations on protein abundances in highly complex libraries. This complements capabilities for measuring effects of genetic perturbations on transcript levels, which is important as for some proteins mRNA and protein levels do not correlate well. The ability to measure proteins directly therefore promises to close an important gap in determining all their regulatory inputs. The strategy is furthermore broadly applicable beyond the current study. All experimental procedures are very well described and plasmids and scripts are openly shared, maximizing utility for the community.

      There is a good balance between global analyses aimed at characterizing properties of the regulatory network and more detailed analyses of interesting new regulatory relationships. Some of the key conclusions are further supported by additional experimental evidence, which includes re-making specific mutations and confirming their effects on protein levels by mass spectrometry.

      The conclusions of the paper are mostly well supported, but I am missing some analyses on reproducibility and potential confounders and some of the data analysis steps should be clarified.

      The paper starts on the premise that measuring protein levels will identify regulators and regulatory principles that would not be found by measuring transcripts, but since the findings are not discussed in light of studies looking at mRNA levels it is unclear how the current study extends knowledge regarding the regulatory inputs of each protein.

      See response to Comment #10.

      Specific comments regarding data analysis, reproducibility, confounders

      1) The authors use the number of unique barcodes per guide RNA rather than barcode counts to determine fold-changes. For reliable fold changes the number of unique barcodes per gRNA should then ideally be in the 100s for each guide, is that the case? It would also be important to show the distribution of the number of barcodes per gRNA and their abundances determined from read counts. I could imagine that if the distribution of barcodes per gRNA or the abundance of these barcodes is highly skewed (particularly if there are many barcodes with only few reads) that could lead to spurious differences in unique barcode number between the high and low fluorescence pool. I imagine some skew is present as is normal in pooled library experiments. The fold-changes in the control pools could show whether spurious differences are a problem, but it is not clear to me if and how these controls are used in the protein screen.

      Because of the large number of screens performed in this study (11 proteins, with 8 replicates for each) we had to trade off sequencing depth and power against cell sorting time and sequencing cost, resulting in lower read and barcode numbers than what might be ideally aimed for. As described further in the response to Comment #5, we added a new figure to the manuscript that shows that the correlation of fold-changes between replicates is high (Figure 3–S1A). The second figure below shows that the correlation between the number of unique barcodes and the number of reads per gRNA is highly significant (p < 2.2e-16).

      2) I like the idea of using an additional barcode (plasmid barcode) to distinguish between different cells with the same gRNA - this would directly allow to assess variability and serve as a sort of replicate within replicate. However, this information is not leveraged in the analysis. It would be nice to see an analysis of how well the different plasmid barcodes tagging the same gRNA agree (for fitness and protein abundance), to show how reproducible and reliable the findings are.

      We agree with the reviewer that this would be nice to do in principle, but our sequencing depth for the sorted cell populations was not high enough to compare the same barcode across the low/unsorted/high samples. See also our response to Comment #5 for the replicate analyses.

      3) From Fig 1 and previous research on base editors it is clear that mutation outcomes are often heterogeneous for the same gRNA and comprise a substantial fraction of wild-type alleles, alleles where only part of the Cs in the target window or where Cs outside the target window are edited, and non C-to-T edits. How does this reflect on the variability of phenotypic measurements, given that any barcode represents a genetically heterogeneous population of cells rather than a specific genotype? This would be important information for anyone planning to use the base editor in future.

      We agree with the reviewer that the heterogeneity of editing outcomes is an important point to keep in mind when working with base editors. In genetic screens, like the ones described here, often the individual edit is less important, and the overall effects of the base editor are specific/localized enough to obtain insights into the effects of mutations in the area where the gRNA targets the genome. For example, in our test screens for Canavanine resistance and fitness effects, in which we used gRNAs predicted to introduce stop codons into the CAN1 gene and into essential genes, respectively, we see the expected loss-of-function effect for a majority of the gRNAs (canavanine screen: expected effect for 67% of all gRNAs introducing stop codons into CAN1; fitness screen: expected effect for 59% of all gRNAs introducing stop codons into essential genes) (Figure 2). In the canavanine screen, we also see that gRNAs predicted to introduce missense mutations at highly conserved residues are more likely to lead to a loss-of-function effect than gRNAs predicted to introduce missense mutations at less conserved residues, further highlighting the differentiated results that can be obtained with the base editor despite the heterogeneity in editing outcomes overall. We would certainly advise anyone to confirm by sequencing the base edits in individual mutants whenever a precise mutation is desired, as we did in this study when following up on selected findings with individual mutants.

      4) How common are additional mutations in the genome of these cells and could they confound the measured effects? I can think of several sources of additional mutations, such as off-target editing, edits outside the target window, or when 2 gRNA plasmids are present in the same cell (both target windows obtain edits). Could some of these events explain the discrepancy in phenotype for two gRNAs that should make the same mutation (Fig S4)? Even though BE3 has been described in mammalian cells, an off-target analysis would be desirable as there can be substantial differences in off-target behavior between cell types and organisms.

      Generally, we are not very concerned about random off-target activity of the base editor because we would not expect this to cause a consistent signal that would be picked up in our screen as a significant effect of a particular gRNA. Reproducible off-target editing with a specific gRNA at a site other than the intended target site would be problematic, though. We limited the chance of this happening by not using gRNAs that may target similar sequences to the intended target site in the genome. Specifically, we excluded gRNAs that have more than one target in the genome when the 12 nucleotides in the seed region (directly upstream of the PAM site) are considered (DiCarlo et al., Nucleic Acids Research, 2013).

      We do observe some off-target editing right outside the target window, but generally at much lower frequency than the on-target editing in the target window (Figure 1B and Figure 1–S2). Since for most of our analyses we grouped perturbations per gene, such off-target edits should not affect our findings. In addition, we validated key findings with independent experiments. For our study, we used the Base Editor v3 (Komor et al., Nature, 2016); more recently, additional base editors have been developed that show improved accuracy and efficiency, and we would recommend these base editors when starting a new study (see, e.g., Anzalone et al., Nature Biotechnology, 2020).

      We are not concerned about cases in which one cell gets two gRNAs, since the chance that the same two gRNAs end up in one cell repeatedly is low, and such events would therefore not result in a significant signal in our screens.

      We don’t think that off-target mutations can explain the discrepancy between pairs of gRNAs that should introduce the same mutation (Figure 3–S1. The effect of the two gRNAs is actually well-correlated, but, often, one of the two gRNAs doesn’t pass our significance cut-off or simply doesn’t edit efficiently (i.e., most discrepancies arise from false negatives rather than false positives). We may therefore miss the effects of some mutations, but we are unlikely to draw erroneous conclusions from significant signals.

      5) In the protein screen normalization uses the total unique barcode counts. Does this efficiently correct for differences from sequencing (rather than total read counts or other methods)? It would be nice to see some replicate plots for the analysis of the fitness as well as the protein screen to be able to judge that.

      We made a new figure that shows a replicate comparison for the protein screen (see below; in the manuscript it is Figure 3–S1A) and commented on it in the manuscript. For this analysis, the eight replicates for each protein were split into two groups of four replicates each and analyzed the same way as the eight replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16). The second figure shows that the total number of reads and the total number of unique barcodes are well correlated.

      For the fitness screen, we used read counts rather than barcode counts for the analysis since read counts better reflect the dropout of cells due to reduced fitness. The figure below shows a replicate comparison for the fitness screen. For this analysis, the four replicates were split into two groups of two replicates each and analyzed the same way as the four replicates. The correlation between the two groups of replicates is highly significant (p < 2.2e-16).

      6) In the main text the authors mention very high agreement between gRNAs introducing the same mutation but this is only based on 20 or so gRNA pairs; for many more pairs that introduce the same mutation only one reaches significance, and the correlation in their effects is lower (Fig S4). It would be better to reflect this in the text directly rather than exclusively in the supplementary information.

      We clarified this in the manuscript main text: “For 78 of these gRNA pairs, at least one gRNA had a significant effect (FDR < 0.05) on at least one of the eleven proteins; their effects were highly correlated (Pearson’s R2 = 0.43, p < 2.2E-16) (Figure 3–S1B). For the 20 gRNA pairs for which both gRNAs had a significant effect, the correlation was even higher (Pearson’s R2 = 0.819, p = 8.8e-13) (Figure 3–S1C). These findings show that the significant gRNA effects that we identify have a low false positive rate, but they also suggest that many real gRNA effects are not detected in the screen due to limitations in statistical power.”

      7) When the different gRNAs for a targeted gene are combined, instead of using an averaged measure of their effects the authors use the largest fold-change. This seems not ideal to me as it is sensitive to outliers (experimental error or background mutations present in that strain).

      We agree that the method we used is more sensitive to outliers than averaging per gene. However, because many gRNAs have no effect either because they are not editing efficiently or because the edit doesn’t have a phenotypic consequence, an averaging method across all gRNAs targeting the same gene would be too conservative and not properly capture the effect of a perturbation of that gene.

      8) Phenotyping is performed directly after editing, when the base editor is still present in the cells and could still interact with target sites. I could imagine this could lead to reduced levels of the proteins targeted for mutagenesis as it could act like a CRISPRi transcriptional roadblock. Could this enhance some of the effects or alter them in case of some missense mutations?

      To reduce potential “CRISPRi-like” effects of the base editor on gene expression, we placed the base editor under a galactose-inducible promoter. For both the fitness and protein screens we grew the cultures in media without galactose for another 24 hours (fitness screen) or 8-9 hours (protein screens) before sampling. In the latter case, this recovery time corresponded to more than three cell divisions, after which we assume base editor levels to have strongly decreased, and therefore to no longer interfere with transcription. This is also supported by our ability to detect discordant effects of gRNAs targeting the same gene (e.g., the two mutations leading to loss-of-function and gain-of-function of RAS2), which would otherwise be overshadowed by a CRISPRi effect.

      9) I feel that the main text does not reflect the actual editing efficiency very well (the main numbers I noticed were 95% C to T conversion and 89% of these occurring in a specific window). More informative for interpreting the results would be to know what fraction of the alleles show an edit (vs wild-type) and how many show the 'complete' edit (as the authors assume 100% of the genotypes generated by a gRNA to be conversion of all Cs to Ts in the target window). It would be important to state in the main text how variable this is for different gRNAs and what the typical purity of editing outcomes is.

      We now show the editing efficiency and purity in a new figure (Figure 1B), and discuss it in the main text as follows: “We found that the target window and mutagenesis pattern are very similar to those described in human cells: 95% of edits are C-to-T transitions, and 89% of these occurred in a five-nucleotide window 13 to 17 base pairs upstream of the PAM sequence (Figure 1A; Figure 1–S2) (Komor et al., 2016). Editing efficiency was variable across the eight gRNAs and ranged from 4% to 64% if considering only cases where all Cs in the window are edited; percentages are higher if incomplete edits are considered, too (Figure 1B).”

      Comments regarding findings

      10) It would be nice to see a comparison of the results to the effects of ~1500 yeast gene knockouts on cellular transcriptomes (https://doi.org/10.1016/j.cell.2014.02.054). This would show where the current study extends established knowledge regarding the regulatory inputs of each protein and highlight the importance of directly measuring protein levels. This would be particularly interesting for proteins whose abundance cannot be predicted well from mRNA abundance.

      We agree with the reviewer that it would be very interesting to compare the effect of perturbations on mRNA vs protein levels. We have compared our protein-level data to mRNA-level data from Kemmeren and colleagues (Kemmeren et al., Cell 2014), and we find very good agreement between the effects of gene perturbations on mRNA and protein levels when considering only genes with q < 0.05 and Log2FC > 0.5 in both studies (Pearson’s R = 0.79, p < 5.3e-15).

      Gene perturbations with effects detected only on mRNA but not protein levels are enriched in genes with a role in “chromatin organization” (FDR = 0.01; as a background for the analysis, only the 1098 genes covered in both studies were considered). This suggests that perturbations of genes involved in chromatin organization tend to affect mRNA levels but are then buffered and do not lead to altered protein levels. There was no enrichment of functional annotations among gene perturbations with effects on protein levels but not mRNA levels.

      We did not include these results in the manuscript because there are some limitations to the conclusions that can be drawn from these comparisons, including that our study has a relatively high number of false negatives, and that the genes perturbed in the Kemmeren et al. study were selected to play a role in gene regulation, meaning that differences in mRNA-vs-protein effects of perturbations are limited to this function, and other gene functions cannot be assessed.

      11) The finding that genes that affect only one or two proteins are enriched for roles in transcriptional regulation could be a consequence of 'only' looking at 10 proteins rather than a globally valid conclusion. Particularly as the 10 proteins were selected for diverse functions that are subject to distinct regulatory cascades. ('only' because I appreciate this was a lot of work.)

      We agree with this, and we think it is clear in the abstract and the main text of the manuscript that here we studied 11 proteins. We made this point also more explicit in the discussion, so that it is clear for readers that the findings are based on the 11 proteins and may not extrapolate to the entire yeast proteome.

      Reviewer #3 (Public Review):

      This manuscript presents two main contributions. First, the authors modified a CRISPR base editing system for use in an important model organism: budding yeast. Second, they demonstrate the utility of this system by using it to conduct an extremely high throughput study the effects of mutation on protein abundance. This study confirms known protein regulatory relationships and detects several important new ones. It also reveals trends in the type of mutations that influence protein abundances. Overall, the findings are of high significance and the method appears to be extremely useful. I found the conclusions to be justified by the data.

      One potential weakness is that some of the methods are not described in main body of the paper, so the reader has to really dive into the methods section to understand particular aspects of the study, for example, how the fitness competition was conducted.

      We expanded the first section for better readability.

      Another potential weakness is the comparison of this study (of protein abundances) to previous studies (of transcript abundances) was a little cursory, and left some open questions. For example, is it remarkable that the mutations affecting protein abundance are predominantly in genes involved in translation rather than transcription, or is this an expected result of a study focusing on protein levels?

      We thank the reviewer for pointing out that this paragraph requires more explanation. We expanded it as follows: “Of these 29 genes, 21 (72%) have roles in protein translation—more specifically, in ribosome biogenesis and tRNA metabolism (FDR < 8.0e-4, Figure 5C). In contrast, perturbations that affect the abundance of only one or two of the eleven proteins mostly occur in genes with roles in transcription (e.g., GO:0006351, FDR < 1.3e-5). Protein biosynthesis entails both transcription and translation, and these results suggest that perturbations of translational machinery alter protein abundance broadly, while perturbations of transcriptional machinery can tune the abundance of individual proteins. Thus, genes with post-transcriptional functions are more likely to appear as hubs in protein regulatory networks, whereas genes with transcriptional functions are likely to show fewer connections.”

      Overall, the strengths of this study far outweigh these weaknesses. This manuscript represents a very large amount of work and demonstrates important new insights into protein regulatory networks.

    1. Author Response

      Reviewer #2 (Public Review):

      The authors seek to determine how various species combine their effects on the growth of a species of interest when part of the same community.

      To this end, the authors carry out an impressive experiment containing what I believe must be one of the largest pairwise + third-order co-culture experiments done to date, using a high-throughput co-culture system they had co-developed in previous work. The unprecedented nature of this data is a major strength of the paper. The authors also discover that species combine their effect through "dominance", i.e. the strongest effect masks the others. This is important as it calls into question the common assumption of additivity that is implicit in the choice of using Lotka-Volterra models.

      A stronger claim (i.e. in the abstract) is that joint effect of multiple species on the growth of another can be derived from the effect of individual species. Unless I am misunderstanding something, this statement may have to be qualified a little, as the authors show that a model based on pairwise dominance (i.e. the strongest pairwise) does a somewhat better job (lower RMSD, though granted, not by much, 0.57 vs 0.63) than a model based on single species dominance. This is, the effect of the strongest pair predicts better the effect of a trio than the effect of the larger species.

      This issue makes one wonder whether, had the authors included higher-order combinations of species (i.e. five-member consortia or higher), the strongest-effect trio would have predicted better than the strongest-effect pair, which in turn is better predictor than the strongest-effect species. This is important, as it would help one determine to what extent the strongest-effect model would work in more diverse communities, such as those one typically finds in nature. Indeed, the authors find that the predictive ability of the strongest effect species is much stronger for pairs than it is for trios (RMSD of 0.28 vs 0.63). Does the predictive ability of the single species model decline faster and faster as diversity grows beyond 4-member consortia?

      Thank you for raising this important point. It is true that in our study we see that single species predict pairs better than trios, and that pairs predict trios better than single species. As we did not perform experiments on more diverse communities (n>4), we are not sure if or how these rules will scale up. We explicitly address these caveats in our revised discussion.

      Reviewer #3 (Public Review):

      A problem in synthetic ecology is that one can't brute-force complex community design because combinatorics make it basically impossible to screen all possible communities from a bank of possible species. Therefore, we need a way to predict phenomena in complex communities from phenomena in simple communities. This paper aims to improve this predictive ability by comparing a few different simple models applied to a large dataset obtained with the use of the author's "kchip" microfluidics device. The main question they ask is whether the effect of two species on a focal species is predicted from the mean, the sum, or the max of the effect of each single "affecting" species on the focal species. They find that the max effect is often the best predictor, in the sense of minimizing the difference between predicted effect and measured effect. They also measure single-species trait data for their library of strains, including resource niche and antibiotic resistance, and then find that Pearson correlations between distance calculations generated from these metrics and the effect of added species are weak and unpredictive. This work is largely well-done, timely and likely to be of high interest to the field, as predicting ecosystem traits from species traits is a major research aim.

      My main criticism is that the main take-home from the paper (fig 3B)-that the strongest effect is the best predictor-is oversold. While it is true that, averaged over their six focal species, the "strongest effect" was the best overall predictor, when one looks at the species-specific data (S9), we see that it is not the best predictor for 1/3 of their focal species, and this fraction grows to 1/2 if one considers a difference in nRMSE of 0.01 to be negligible.

      As suggested, we have softened our language regarding the take-home message. This matter is addressed in detail above in response to 'Essential Revisions'. Briefly, we see that the strongest model works best when both single species have qualitatively similar effects, but is slightly less accurate when effects are mixed. We also see overall less accurate predictions for positive effects. In light of these findings, we propose that focal species for which the strongest model is not the most accurate is due to the interaction types, and not specific to the focal species.

      We made substantial changes to the manuscript, including the first paragraph of the discussion which more accurately describes these findings and emphasizes the relevant caveats:

      "By measuring thousands of simplified microbial communities, we quantified the effects of single species, pairs, and trios on multiple focal species. The most accurate model, overall and specifically when both single species effects were negative, was the strongest effect model. This is in stark contrast to models often used in antibiotic compound combinations, despite most effects being negative, where additivity is often the default model (Bollenbach 2015). The additive model performed well for mixed effects (i.e. one negative and one positive), but only slightly better than the strongest model, and poorly when both species had effects of the same sign. When both single species’ effects were positive, the strongest model was also the best, though the difference was less pronounced and all models performed worse for these interactions. This may be due to the small effect size seen with positive effects, as when we limited negative and mixed effects to a similar range of effects strength, their accuracy dropped to similar values (Figure 3–Figure supplement 5). We posit that the difference in accuracy across species is affected mainly by the effect type dominating different focal species' interactions, rather than by inherent species traits (Figure 3–Figure supplement 6)." (Lines 288-304)

      The same criticism applies to the result from figure 2-that pairs of affecting species have more negative effects than single species. Considered across all focal species this is true (though minor in effect size, Fig 2A). But there is only a significant effect within two individual species. Again, this points to the effects being focal-species-specific, and perhaps not as generalizable as is currently being claimed.

      Upon more rigorous analysis, and with regard to changes in the dataset after filtering, we see that the more accurate statement is that effects become stronger, not necessarily more negative (in line with the accuracy of the strongest model). The overall trend is towards more negative interactions, due to the majority of interactions being negative, but as stated this is not true for each individual focal. As such the following sentence in the manuscript has been changed:

      "The median effect on each focal was more negative by 0.28 on average, though the difference was not significant in all cases; additionally, focals with mostly positive single species interactions showed a small increase in median effect (Fig. 2D)" (Lines 151-154)

      As well as the title of this section: "Joint effects of species pairs tend to be stronger than those of individual affecting species" (Lines 127-128)

      Another thing that points to a focal-species-specific response is Fig 2D, which shows the distributions of responses of each focal species to pairs. Two of these distributions are unimodal, one appears bimodal, and three appear tri-modal. This suggests to me that the focal species respond in categorically different ways to species addition.

      We believe this distribution of pair effects is related to the distribution of single species effects, and not to the way in which different focal species respond to the addition of second species. Though this may be difficult to see from the swarm plots shown in the paper, below is a split violin plot that emphasizes this point.

      Fig R1: Distribution of single species and pair effects. Distribution of the effect of single and pairs of affecting species for each focal species individually. Dashed lines represent the median, while dotted lines the interquartile range.

      These differences occur even though the focal bacteria are all from the same family. This suggests to me that the generalizability may be even less when a more phylogenetically dispersed set of focal species are used.

      We have added the following sentence to the discussion explicitly emphasizing the phylogenetic limitations of our study:

      "Lastly, it is important to note that our focal species are all from the same order (Enterobacterales), which may also limit the purview of our findings." (Lines 364-366)

      Considering these points together, I argue that the conclusion should be shifted from "strongest effect is the best" to "in 3 of our focal species, strongest effect was the best, but this was not universal, and with only 6 focal species, we can't know if it will always be the best across a set of focal species".

      As mentioned above, we have softened our language regarding the take-home message in response to these evaluations.

      My second main criticism is that it is hard to understand exactly how the trait data were used to predict effects. It seems like it was just pearson correlation coefficients between interspecies niche distances (or antibiotic distances) and the effect. I'm not very surprised these correlations were unpredictive, because the underlying measurements don't seem to be relevant to the environment tested. What if, rather than using niche data across 20 nutrients, only the growth data on glucose (the carbon source in the experiments) was used? I understand that in a field experiment, for example, one might not know what resources are available, and so measuring niche across 20 resources may be the best thing to do. Here though it seems imperative to test using the most relevant data.

      It is true that much of the profiling data is not directly related to the experimental conditions (different carbon sources and antibiotics), but in addition to these we do use measurements from experiments carried out in the same environment as the interactions assays (i.e. growth rate and carrying capacity when growing on glucose), which also showed poor correlation with the effects on focals. Additionally, we believe that these profiles contain relevant information regarding metabolic similarity between species (similar to metabolic models often constructed computationally). To improve clarity, we added the following sentence to the figure legend of Figure 3–Figure supplement 1:

      "The growth rate, and maximum OD shown in panel A were measured only in M9 glucose, similar to conditions used in the interaction assays." (Lines 591-592)

      Additionally and relatedly, it would be valuable to show the scatterplots leading to the conclusion that trait data were uninformative. Pearson's r only works on an assumption of linearity. But there could be strong relationships between the trait data and effect that are monotonic but not linear, or even that are non-monotonic yet still strong (e.g. U-shaped). For the first case, I recommend switching to Spearman's rho over Pearson's r, because it only assumes monotonicity, not linearity. If there are observable relationships that are not monotonic, a different test should be used.

      Per your suggestion, we have changed the measurement of correlation in this analysis from Pearson's r, to Spearman's rho. As we observed similar, and still mostly weak correlations, we did not investigate these relationships further. See Figure 3–Figure supplement 1.

      Additionally, we generated heat maps including scatterplots mapping the data leading to these correlations. We found no notable dependency in these plots, and visually they were quite crowded and difficult to interpret. As this is not the central point of our study, we ultimately decided against adding this information to the plots.

      In general, I think the analyses using the trait data were too simplistic to conclude that the trait data are not predictive.

      We agree that more sophisticated analyses may help connect between species traits and their effects on focal species. In fact, other members of our research group have recently used machine learning to accomplish similar predictions (https://doi.org/10.1101/2022.08.02.502471). As such we have changed the wording in to reflect that this correlation is difficult to find using simple analyses:

      "These results indicate that it may be challenging to connect the effects of single and pairs of species on a focal strain to a specific trait of the involved strains, using simple analysis." (Lines 157-159)

    1. Author Response

      Reviewer #1 (Public Review):

      Slusarczyk et al present a very well written manuscript focused on understanding the mechanisms underlying aging of erythrophagocytic macrophages in the spleen (RPM) and its relationship to iron loading with age. The manuscript is diffuse with a broad swath of data elements. Importantly, the manuscript demonstrates that RPM erythrophagocytic capacity is diminished with age, restored in iron restricted diet fed aged mice. In addition, the mechanism for declining RPM erythrophagocytic capacity appears to be ferroptosis-mediated, insensitive to heme as it is to iron, and occur independently of ROS generation. These are compelling findings. However, some of the data relies on conjecture for conclusion and a clear causal association is not clear. The main conclusion of the manuscript points to the accumulation of unavailable insoluble forms of iron as both causing and resulting from decreased RPM erythrophagocytic capacity.

      We are proposing that intracellular iron accumulation progresses first and leads to global proteotoxic damage and increased lipid peroxidation. This eventually triggers the death of a fraction of aging RPMs, thus promoting the formation of extracellular iron-rich protein aggregates. More explanation can be found below. Besides, iron loading suppresses the erythrophagocytic activity of RPMs, hence further contributing to their functional impairment during aging.

      In addition, the finding that IR diet leads to increased TF saturation in aged mice is surprising.

      We believe that this observation implies better mobilization of splenic iron stores, and corroborates our conclusion that mice that age on an iron-reduced diet benefit from higher iron bioavailability, although these differences are relatively mild. More explanation can be found in our replies to Reviewer #2.

      Furthermore, whether the finding in RPMs is intrinsic or related to RBC-related changes with aging is not addressed.

      We now addressed this issue and we characterized in more detail both iron and ROS levels in RBCs.

      Finally, these findings in a single strain and only female mice is intriguing but warrants tempered conclusions.

      We tempered the conclusions and provided a basic characterization of the RPM aging phenotype in Balb/c female mice.

      Major points:

      1) The main concern is that there is no clear explanation of why iron increases during aging although the authors appear to be saying that iron accumulation is both the cause of and a consequence of decreased RPM erythrophagocytic capacity. This requires more clarification of the main hypothesis on Page 4, line 17-18.

      We thank the reviewer for this comment. It was previously reported that iron accumulates substantially in the spleen during aging, especially in female mice (Altamura et al., 2014). Since RPMs are those cells that process most of the iron in the spleen, we aimed to explore what is the relationship between iron accumulation and RPM functions during aging. This investigation led us to uncover that indeed iron accumulation is both the cause and the consequence of RPM dysfunction. Specifically, we propose that intracellular iron loading of RPMs precedes extracellular deposition of iron in a form of protein-rich aggregates, driven by RPMs damage. To support this, we now show that the proteome of RPMs overlaps with those proteins that are present in the age-triggered aggregates (Fig. 3F). Furthermore, corroborating our model, we now demonstrate that transient iron loading of RPMs via iron-dextran injection (new Fig. 3G) leads to the formation of protein-rich aggregates, closely resembling those present in aged spleens (new Fig. 3H). This implies that high iron content in RPMs is indeed a major driving factor that leads to aggregation of their proteome and cell damage. Importantly, we now supported this model with studies using iRPMs. We demonstrated that iron loading and blockage of ferroportin by synthetic mini-hepcidin (PR73)(Stefanova et al., 2018) cause protein aggregation in iRPMs and lead to their decreased viability only in cells that were exposed to heat shock, a well-established trigger of proteotoxicity (new Fig. 5K and L). We propose that these two factors, namely age-triggered decrease in protein homeostasis and exposure to excessive iron levels, act in concert and render RPMs particularly sensitive to damage during aging (see also Discussion, p. 16).

      In parallel, our data imply that the increased iron content in aged RPMs drives their decreased erythrophagocytic activity, as we now better documented by more extensive in vitro experiments in iRPMs (new Fig 6E-H). We cannot exclude that some of the senescent splenic RBCs that are retained in the red pulp and evade erythrophagocytosis due to RPM defects in aging, may also contribute to the formation of the aggregates. This is supported by the fact that mice that lack RPMs as well exhibit iron loading in the spleen (Kohyama et al., 2009; Okreglicka et al., 2021), and that the proteome of aggregates overlaps to some extent with the proteome of erythrocytes (new Fig. 3F).

      We believe that during aging intracellular iron accumulation is chiefly driven by ferroportin downregulation, as also suggested by Reviewer#3. We now show that ferroportin drops significantly already in mice aged 4 and 5 months (new Fig. 4H), preceding most of the other impairments. This drop coincides with the increase in hepcidin expression, but if this is the sole reason for ferroportin suppression during early aging would require further investigation outside the scope of the present manuscript.

      In sum, to address this comment, we now modified the fragment of the introduction that refers to our hypothesis and major findings to be more clear (p. 4), we improved our manuscript by providing new data mentioned above and we added more explanation in the corresponding sections of the Results and Discussion.

      2) It is unclear if RPMs are in limited supply. Based on the introduction (page 4, line 13-15), they have limited self-renewal capacity and blood monocytes only partially replenished. Fig 4D suggests that there is a decrease in RPMs from aged mice. The %RPM from CD45+ compartment suggests that there may just be relatively more neutrophils or fewer monocytes recruited. There is not enough clarity on the meaning of this data point.

      Thank you for this comment. We fully agree that %RPMs of CD45+ splenocytes, although well-accepted in literature (Kohyama et al., 2009; Okreglicka et al., 2021), is only a relative number. Hence, we now included additional data and explanations regarding the loss of RPMs during aging.

      It was reported that the proportion of RPMs derived from bone marrow monocytes increases mildly but progressively during aging (Liu et al., 2019). This implies that due to the loss of the total RPM population, as illustrated by our data, the cells of embryonic origin are likely even more affected. We could confirm this assumption by re-analysis of the data from Liu et al. that we now included in the manuscript as Fig. 5E. These data clearly show that the representation of embryonically-derived RPMs drops more drastically than the percent of total RPMs, whereas the replenishment rate from monocytes is not affected significantly during aging. Consistent with this, we have not observed any robust change in the population of monocytes (F4/80-low, CD11b-high) or pre-RPMs (F4/80-high, CD11b-high) in the spleen at the age of 10 months (Figure 5-figure supplement 2A and B). We also have detected a mild decrease, not an increase, in the number of granulocytes (new Figure 5-figure supplement 2C). Furthermore, we measured in situ apoptosis marker and found a clear sign of apoptosis in the aged spleen (especially in the red pulp area), a phenotype that is less pronounced in mice on an IR diet (new Fig. 5O). This is consistent with the observation that apoptosis markers can be elevated in tissues upon ferroptosis induction (Friedmann Angeli et al., 2014) and that the proteotoxic stress in aged RPMs, which we now emphasized better in our manuscript, may also lead to apoptosis (Brancolini & Iuliano, 2020). Taken together, we strongly believe that the functional defect of embryonically-derived RPMs chiefly contributes to their shortage during aging.

      3) Anemia of aging is a complex and poorly understood mechanistically. In general, it is considered similar to anemia of chronic inflammation with increased Epo, mild drop in Hb, and erythroid expansion, similar to ineffective erythropoiesis / low Epo responsiveness. It is not surprising that IR diet did not impact this mild anemia. However, was the MCV or MCH altered in aged and IR aged mice?

      We now included the data for hematocrit, RBC counts, MCV, and MCH in Figure 1-figure supplement 5. Hematocrit shows a similar tendency as hemoglobin levels, but the values for RBC counts, MCV, and MCH seem not to be altered. We also show now that the erythropoietic activity in the bone marrow is not affected in aged versus young mice. Taken together, the anemic phenotype in female C57BL/6J mice at this age is very mild, which we emphasized in the main text, and is likely affected by other factors than serum iron levels (p. 6).

      4) Page 6, line 23 onward: the conclusion is that KC compensate for the decreased function of RPM in the spleen, based on the expansion of KC fraction in the liver. Is there evidence that KCs are engaged in more erythrophagocytosis in aged mice? Furthermore, iron accumulation in the liver with age does not demonstrate specifically enhanced erythrophagocytosis of KC. Please clarify why liver iron accumulation would not be simply a consequence of increased parenchymal iron similar to increased splenic iron with age, independent of erythrophagocytic activity in resident macrophages in either organ.

      Thanks for these questions. For the quantification of the erythrophagocytosis rate in KC, we show, as for the RPMs (Fig. 1K), the % of PKH67-positive macrophages, following transfusion of PKH67-stained stressed RBCs (Fig. 1M). The data implies a mild (not statistically significant) drop (of approx. 30%) in EP activity. We believe that it is overridden by a more pronounced (on average, 2-fold) increase in the representation of KCs (Fig. 1N). The mechanisms of iron accumulation between the spleen and the liver are very different. In the liver, we observed iron deposition in the parenchymal cells (not non-parenchymal, new Fig. 1P) that we currently characterizing in more detail in a parallel manuscript. Our data demonstrate a drop in transferrin saturation in aged mice. Hence, it is highly unlikely that aging would be hallmarked by the presence of circulating non-transferrin-bound iron that would be sequestered by hepatocytes, as shown previously (Jenkitkasemwong et al., 2015). Thus, the iron released locally by KCs is the most likely contributor to progressive hepatocytic iron loading during aging. The mechanism of iron delivery to hepatocytes from erythrophagocytosing KCs was demonstrated by Theurl et al.(Theurl et al., 2016), and we propose that it may be operational, although in a much more prolonged time scale, during aging. We now discussed this part better in our Results sections (p. 7).

      5) Unclear whether the effect on RPMs is intrinsic or extrinsic. Would be helpful to evaluate aged iRPMs using young RBC vs. young iRPMs using old RBCs.

      We are skeptical if the generation of iRPMs cells from aged mice would be helpful – these cells are a specific type of primary macrophage culture, derived from bone marrow monocytes with MCSF1, and exposed additionally to heme and IL-33 for 4 days. We do not expect that bone marrow monocytes are heavily affected by aging, and would thus recapitulate some aspects of aged RPMs from the spleen, especially after 8-day in vitro culture. However, to address the concerns of the reviewer, we now provide additional data regarding RBC fitness. Consistent with the time life-span experiment (Fig, 2A), we show that oxidative stress in RBCs is only increased in splenic, but not circulating RBCs (new Fig. 2C, replacing the old Fig. 2B and C). In addition, we show no signs of age-triggered iron loading in RBCs, either in the spleen (new Fig. 2F) or in the circulation (new Fig. 2B). Hence, we do not envision a possibility that RPMs become iron-loaded during aging as a result of erythrophagocytosis of iron-loaded RBCs. In support of this, we also have observed that during aging first RPMs’ FPN levels drop, afterward erythrophagocytosis rate decreases, and lastly, RBCs start to exhibit significantly increased oxidative stress (presented now in new Fig. 4H, J and K).

      6) Discussion of aggregates in the spleen of aged mice (Fig 2G-2K and Fig 3) is very descriptive and non-specific. For example, if the iron-rich aggregates are hemosiderin, a hemosiderin-specific stain would be helpful. This data specifically is correlatory and difficult to extract value from.

      Thanks for these comments. To the best of our knowledge Prussian blue Perls’ staining (Fig. 2J) is considered a hemosiderin staining. Our investigations aimed to better understand the nature and the origin of splenic iron deposits that to some extent are referred to as hemosiderin. Most importantly, as mentioned in our reply R1 Ad. 1. to assign causality to our data, we now demonstrated that iron accumulation in RPMs in response to iron-dextran (Fig. 3G) increases lipid peroxidation (Fig. 5F), tends to provoke RPMs depletion (Fig. 5G) and triggers the formation of protein-rich aggregates (new Fig. 3H). Of note, we assume that the loss of embryonically-derived RPMs in this model may be masked by simultaneous replenishment of the niche from monocytes, a phenomenon that may be addressed by future studies using Ms4a3-driven reporter mice (as shown for aged mice in our new Fig. 5E).

      7) The aging phenotype in RPMs appears to be initiated sometime after 2 months of age. However, there is some reversal of the phenotype with increasing age, e.g. Fig 4B with decreased lipid peroxidation in 9 month old relative to 6 month old RPMs. What does this mean? Why is there a partial spontaneous normalization?

      Thanks for this comment and questions. Indeed, the degree of lipid peroxidation exhibits some kinetics, suggestive of partial normalization. Of note, such a tendency is not evident for other aging phenotypes of RPMs, hence, we did not emphasize this in the original manuscript. However, in a revised version of the manuscript, we now present the re-analysis of the published data which implies that the number of embryonically-derived RPMs drops substantially between mice at 20 weeks and 36 weeks (new Fig. 5E). We think that the higher proportion of monocyte-derived RPMs in total RPM population later in aging (9 months) might be responsible for the partial alleviation of lipid peroxidation. We now discussed this possibility in the Results sections (p. 12).

      8) Does the aging phenotype in RPMs respond to ferristatin? It appears that NAC, which is a glutathione generator and can reverse ferroptosis, does not reverse the decreased RPM erythrophagocytic capacity observed with age yet the authors still propose that ferroptosis is involved. A response to ferristatin is a standard and acceptable approach to evaluating ferroptosis.

      We fully agree with the Reviewer that using ferristatin or Liproxstatin-1 would be very helpful to fully characterize a mechanism of RPMs depletion in mice. However, previous in vivo studies involving Liproxstatin-1 administration required daily injections of this ferroptosis inhibitor (Friedmann Angeli et al., 2014). This would be hardly feasible during aging. Regarding the experiments involving iron-dextran injection, using Liproxstatin-1 would require additional permission from the ethical committee which takes time to be processed and received. However, to address this question we now provide data from iRPMs cell cultures (new Fig.5 K-L). In essence, our results imply that both proteotoxic stress and iron overload act in concert to trigger cytotoxicity in RPM in vitro model. Interestingly, this phenomenon does not depend solely on the increased lipid peroxidation, but when we neutralize the latter with Liproxstatin-1, the cytotoxic effect is diminished (please, see also Results on p. 13 and Discussion p. 15/16).

      9) The possible central role for HO-1 in the pathophysiology of decreased RPM erythrophagocytic capacity with age is interesting. However, it is not clear how the authors arrived at this hypothesis and would be useful to evaluate in the least whether RBCs in young vs. aged mice have more hemoglobin as these changes may be primary drivers of how much HO-1 is needed during erythrophagocytosis.

      Thanks for this comment. We got interested in HO-1 levels based on the RNA sequencing data, which detected lower Hmox-1 expression in aged RPMs (Figure 3-figure supplement 1). We now show that the content of hemoglobin is not significantly altered in aged RBCs (MCH parameter, Figure 1-figure supplement 5E), hence we do not think that this is the major driver for Hmox-1 downregulation. Likewise, the levels of the Bach1 message, a gene encoding Hmox-1 transcriptional repressor, are not significantly altered according to RNAseq data. Hence, the reason for the transcriptional downregulation of Hmox-1 is not clear. Of note, HO-1 protein levels in the total spleen are higher in aged versus young mice, and we also detected a clear appearance of its nuclear truncated and enzymatically-inactive form (see a figure below, we opt not to include this in the manuscript for better clarity). The appearance of truncated HO-1 seems to be partially rescued by the IR diet. It is well established that the nuclear form of HO-1 emerges via proteolytic cleavage and migrates to the nucleus under conditions of oxidative stress (Mascaro et al., 2021). This additionally confirms that the aging spleen is hallmarked by an increased burden of ROS. Moreover, we also detected HO-1 as one of the components of the protein iron-rich aggregates. Thus, we propose that the low levels of the cytoplasmic enzymatically active form of HO-1 in RPMs (that we preferentially detect with our intracellular staining and flow cytometry) may be underlain by its nuclear translocation and sequestration in protein aggregates that evade antibody binding [this is also supported by our observation that the protein aggregates, despite the high content of ferritin (as indicated by MS analysis) are negative for L-ferritin staining. Of note, we also cannot exclude that other cell types in the aging spleen (eg. lymphocytes) express higher levels of HO-1 in response to splenic oxidative stress.

      Fig. Total splenic levels of HO-1 in young, aged IR and aged mice.

      Reviewer #2 (Public Review):

      Slusarczyk et al. investigate the functional impairment of red pulp macrophages (RPMs) during aging. When red blood cells (RBCs) become senescent, they are recycled by RPMs via erythrophagocytosis (EP). This leads to an increase in intracellular heme and iron both of which are cytotoxic. The authors hypothesize that the continuous processing of iron by RPMs could alter their functions in an age-dependent manner. The authors used a wide variety of models: in vivo model using female mice with standard (200ppm) and restricted (25ppm) iron diet, ex vivo model using EP with splenocytes, and in vitro model with EP using iRPMs. The authors found iron accumulation in organs but markers for serum iron deficiency. They show that during aging, RPMs have a higher labile iron pool (LIP), decreased lysosomal activity with a concomitant reduction in EP. Furthermore, aging RPMs undergo ferroptosis resulting in a non-bioavailable iron deposition as intra and extracellular aggregates. Aged mice fed with an iron restricted diet restore most of the iron-recycling capacity of RPMs even though the mild-anemia remains unchanged.

      Overall, I find the manuscript to be of significant potential interest. But there are important discrepancies that need to be first resolved. The proposed model is that during aging both EP and HO-1 expression decreases in RPMs but iron and ferroportin levels are elevated. In their model, the authors show intracellular iron-rich proteinaceous aggregates. But if HO-1 levels decrease, intracellular heme levels should increase. If Fpn levels increase, intracellular iron levels should decrease. How does LIP stay high in RPMs under these conditions? I find these to be major conflicting questions in the model.

      We thank the Reviewer for her/his valuable feedback. As we mentioned in our replies we can only assume that a small misunderstanding in the interpretation of the presented data underlies this comment. We show that ferroportin levels in RPMs (Fig. 1F) are modulated in a manner that fully reflects the iron status of these cells (both labile and total iron levels, Figs. 1H and I). FPN levels drop in aged RPMs and are rescued when mice are maintained on a reduced iron diet. As pointed out by Reviewer#3, and explained in our replies we believe that ferroportin levels are critical for the observed phenotypes in aging. We now described our data in a more clear way to avoid any potential misinterpretation (p.6).

      Reviewer #3 (Public Review):

      This is a comprehensive study of the effects of aging of the function of red pulp macrophages (RPM) involved in iron recycling from erythrocytes. The authors document that insoluble iron accumulates in the spleen, that RPM become functionally impaired, and that these effects can be ameliorated by an iron-restricted diet. The study is well written, carefully done, extensively documented, and its conclusions are well supported. It is a useful and important addition for at least three distinct fields: aging, iron and macrophage biology.

      The authors do not explain why an iron-restricted diet has such a strong beneficial effect on RPM aging. This is not at all obvious. I assume that the number of erythrocytes that are recycled in the spleen, and are by far the largest source of splenic iron, is not changed much by iron restriction. Is the iron retention time in macrophages changed by the diet, i.e. the recycled iron is retained for a short time when diet is iron-restricted (making hepcidin low and ferroportin high), and long time when iron is sufficient (making hepcidin high and ferroportin low)? Longer iron retention could increase damage and account for the effect. Possibly, macrophages may not empty completely of iron before having to ingest another senescent erythrocyte, and so gradually accumulate iron.

      We are very grateful to this Reviewer for emphasizing the importance of the iron export capacity of RPMs as a possible driver of the observed phenotypes. Indeed, as mentioned above, we now show in the revised version of the manuscript that ferroportin drops early during aging (revised Fig. 4). Importantly, we now also observed that iron loading and limitation of iron export from iRPMs via ferroportin aggravate the impact of heat shock (a well-accepted trigger of proteotoxicity) on both protein aggregation and cell viability (new Fig. 5K and L). Physiologically, recent findings show that aging promotes a global decrease in protein solubility [BioRxiv manuscript (Sui X. et al., 2022)], and it is very likely that the constant exposure of RPMs to high iron fluxes renders these specialized cells particularly sensitive to proteome instability. This could be further aggravated by a build-up of iron due to the drop of ferroportin early during aging, ultimately leading to the appearance of the protein aggregates as early as at 5 months of age in C57BL/6J females. Based on the new data, we emphasized this model in the revised version of the manuscript (please, see Discussion on p. 16)

    1. Author Response

      Reviewer #1 (Public Review):

      1) It would be helpful to include some sort of comparison in Fig. 4, e.g. the regressions shown in Fig 3, to indicate to what extent the ICCl data corresponds to the "control range" of frequency tuning.

      Figure 4 was modified to show the frequency range typically found in the ICCls. This range is based on results from Wagner et al., 2007, which extensively surveyed ICCls responses. This modification shows that our ICCls recordings in the ruff-removed owls cover the normal frequency hearing range of the owl.

      2) A central hypothesis of the study is that the frequency preference of the high-frequency neurons is lower in ruff-removed owls because of the lowered reliability caused by a lack of the ruff. Yet, while lower, the frequency range of many neurons in juvenile and ruff-removed owls seems sufficiently high to be still responsive at 7-8 kHz. I think it would be important to know to what extent neurons are still ITD sensitive at the "unreliable high frequencies" even if the CFs are lower since the "optimization" according to reliability depends not on the best frequency of each neuron per se, but whether neurons are less ITD sensitive at the higher, less reliable frequencies.

      The concern regarding the frequency range that elicits responsivity was largely addressed above. Specifically, Figure L1 showing frequency tuning of frontally tuned ICx neurons in ruff-removed owls indicates that while there is some variability of tuning across neurons, there is little responsivity above 6 kHz. In contrast, equivalent analysis in juvenile owls (Figure L3), shows there is much more responsiveness and variability across neurons to high and low frequencies. This evidence supports our hypothesis that the juvenile owl brain is still highly plastic, which facilitates learning during development. Although the underlying data was already reported in Figure 7 of our previously submitted manuscript, we can include Figures L1 and L2, potentially as supplemental figures, if considered useful by editors and reviewers. Nevertheless, this argumentation was further expanded in the revised text (Line 229).

      Figure L1. Frequency tuning of frontally-tuned ICx neurons in ruff-removed owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      Figure L2. ITD sensitivity across frequencies in ruff-removed owl. Two example neurons shown in a and b. ITD tuning for tones (colored) and broadband (black) plotted by firing rate (non-normalized). Solid colored lines indicate responses to frequencies that are within the neuron’s preferred frequency range (i.e. above the half-height, see Methods), dashed lines indicate frequencies outside of the neuron’s frequency range.

      Figure L3. Frequency tuning of frontally-tuned ICx neurons in juvenile owls. Tuning curves are normalized by the max response. Thick black line indicates the average tuning curve. Dashed black line indicates basal response.

      3) It would be interesting to have an estimate of the time scale of experience dependency that induces tuning changes. Do the authors have any data on this question? I appreciate the authors' notion that the quantifications in Fig 7 might indicate that juvenile owls are already "beginning to be shaped by ITD reliability" (line 323 in Discussion). How many days after hearing onset would this correspond to? Does this mean that a few days will already induce changes?

      While tracking changes induced by ruff-removal over development were outside of the scope of this study, many other studies have assessed experience-dependent plasticity in the barn owl. The recordings in this study were performed approximately 20 days after hearing onset, suggesting that the juveniles had ample time to begin learning. These points were expanded upon in the discussion (Lines 254, 280-283).

      Reviewer #2 (Public Review):

      1) Why is IPD variability plotted instead of ITD variability (or indeed spatial reliability)? The relationship between these measures is likely to vary across frequency, which makes it difficult to compare ITD variability across frequency when IPDs are plotted. Normalizing data across frequencies also makes it difficult to compare different locations and acoustical conditions. For example, in Fig.1a and Fig.1b, the data shown for 3 kHz at ~160 degrees seems quantitatively and visually quite different, but the difference (in Fig.1c) appears to be negligible.

      Justification of why IPD variability is used as an estimate of ITD variability was added to introduction (Lines 55-60), results (Line 100) and methods (Lines 371-374) sections of the manuscript, explaining the fact that because ITD detection is based on phase locking by auditory nerve and ITD detector neurons tuned to narrow frequency bands, responses of ITD detector neurons forwarded to downstream midbrain regions are therefore determined by IPD variability. Additionally, ITD is calculated by dividing IPD by frequency, which makes comparisons of ITD reliability across frequency mathematically uninformative.

      2) How well do the measures of ITD reliability used reflect real-world listening? For example, the model used to calculate ITD reliability appears to assume the same (flat) spectral profile for targets and distractors, which are presented simultaneously with the same temporal envelope, and a uniform spatial distribution of sounds across space. It is therefore unclear how robust the study's results are to violations of these assumptions.

      While we agree that our analysis cannot completely capture real-world listening for the barn owl, a general analysis using similar flat spectral profiles for targets and concurrent sounds provides a broad assessment of reliability of ITD cues. While a full recapitulation of real-world listening is beyond the scope of this study (i.e. recording natural scenes from the ear canals of wild barn owls), we included additional analyses of ITD reliability in Figure 1-figure supplement 1, described above.

      3) Does facial ruff removal produce an isolated effect on ITD variability or does it also produce changes in directional gain, and the relationship between spatial cues and sound location? Although the study considers this issue in some places (e.g. Fig.2, Fig.5), a clearer presentation of the acoustical effects of facial ruff removal and their implications (for all locations, not just those to the front), as well as an attempt to understand how these acoustical changes lead to the observed changes in ITD reliability, would greatly strengthen the study. In addition, Fig.1 shows average ITD reliability across owls, but it would be helpful to know how consistent these measures are across owls, given individual variability in Head-Related Transfer Functions (HRTFs). This potentially has implications for the electrophysiological experiments, if the HRTFs of those animals were not measured. One specific question that is potentially very relevant is whether the facial ruff attenuates sounds presented behind the animal and whether it does so in a frequency-dependent way. In addition, if facial ruff removal enables ILDs to be used for azimuth, then ITDs may also become less necessary at higher frequencies, even if their reliability remains unchanged.

      Additional analysis was conducted to generate representation of changes in directional gain induced by ruff removal, added to new figure (Fig 5). This analysis shows that changes in gain following ruff-removal are largely frequency-independent: there is a de-attenuation of peripherally and rearwardly located sounds, but the highest gain remains for high frequencies in frontal space. There is an additional increase in gain for high frequencies from rearward space, these changes would not explain the changes in frequency tuning we report. As mentioned in new additions to the manuscript, the changes at the most rearward-located auditory spatial locations are unlikely to have an effect on the auditory midbrain. No studies in the barn owl have found neurons in the ICx or optic tectum tuned to >120° (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). In addition, variability of IPD reliability across owls was analyzed and reported in the amended Figure 1, which notes very little changes across owls. In this analysis, we did realize that the file of one of the HRTFs obtained from von Campenhausen et al. 2006 was mislabeled, which explains slight differences in revised Fig 1b. Nevertheless, added analysis of IPD reliability across owls indicates that the pattern in ITD reliability is stable across owls (Fig. 1d,e), which supports our decision to not record HRTFs from owls used in this study. Finally, we added to the discussion that clarifies that the use of ILD for azimuth would not provide the same resolution as ITD would (Lines 295-303). We also do not believe that the use of ILD for azimuth would make “ITDs… less necessary at higher frequencies”, given that the ICCls is still computing ITD at these high frequencies (Fig 4), and that ILDs also have higher resolution at higher frequencies, with and without the facial ruff (Olsen et al, 1989; Keller et al., 1998; von Campenhausen et al., 2006).

      1) It is unclear why some analyses (Fig.5, Fig.7) are focused on frontal locations and frontally-tuned neurons. It is also unclear why neurons with a best ITDs of 0 are described as frontally tuned since locations behind the animal produce an ITD of 0 also. Related to this, in Fig.1, facial ruff removal appears to reduce IPD variability at low frequencies for locations to the rear (~160 degrees), where the ITD is likely to be close to 0. Neurons with a best ITD of 0 might therefore be expected to adjust their frequency tuning in opposite directions depending on whether they are tuned to frontal or rearward locations.

      An extensive explanation was added to the methods detailing why we do not believe the neurons recorded in this study are tuned to the rear. Namely, studies mapping the barn owl’s ICx and optic tectum have not reported neurons tuned to locations >120°, with the number of neurons representing a given spatial location decreasing with eccentricity (Knudsen, 1982; Knudsen, 1984; Cazettes et al., 2014). While we agree that there does seem to be a change in ITD reliability at ~160° following ruff-removal, the result is largely similar to the change that occurs in frontal space (Fig 1b), which is consistent with the ruff-removed head functioning as a sphere. Thus, we wouldn’t expect rearwardly-tuned neurons, if they could be readily found, to adjust their frequency tuning to higher frequencies. Finally, we want to clarify that we focused our analyses on frontally-tuned neurons because frontal space is where we observed the largest change in ITD reliability. Text was added to the Discussion section to clarify this point (Lines 313-321).

      2) The study suggests that information about high-frequency ITDs is not passed on to the ICX if the ICX does not contain neurons that have a high best frequency. However, neurons might be sensitive to ITDs at frequencies other than the best frequency, particularly if their frequency tuning is broader. It is also unclear whether the best frequency of a neuron always corresponds to the frequency that provides the most reliable ITD information, which the study implicitly assumes.

      The concern about ITD sensitivity at non-preferred frequencies was addressed under the essential revision #3, as well as under Reviewer 1’s concerns.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript reports a systematic study of the cortical propagation patterns of human beta bursts (~13-35Hz) generated around simple finger movements (index and middle finger button presses).

      The authors deployed a sophisticated and original methodology to measure the anatomical and dynamical characteristics of the cortical propagation of these transient events. MEG data from another study (visual discrimination task) was repurposed for the present investigation. The data sample is small (8 participants). However, beta bursts were extracted over a +/- 2s time window about each button press, from single trials, yielding the detection and analysis of hundreds of such events of interest. The main finding consists of the demonstration that the cortical activity at the source of movement related beta bursts follows two main propagation patterns: one along an anteroposterior directions (predominantly originating from pre central motor regions), and the other along a medio- lateral (i.e., dorso lateral) direction (predominantly originating from post central sensory regions). Some differences are reported, post-hoc, in terms of amplitude/cortical spread/propagation velocity between pre and post-movement beta bursts. Several control tests are conducted to ascertain the veracity of those findings, accounting for expected variations of signal-to-noise ration across participants and sessions, cortical mesh characteristics and signal leakage expected from MEG source imaging.

      One major perceived weakness is the purely descriptive nature of the reported findings: no meaningful difference was found between bursts traveling along the two different principal modes of propagation, and importantly, no relation with behavior (response time) was found. The same stands for pre vs. post motor bursts, except for the expected finding that post-motor bursts are more frequent and tend to be of greater amplitude (yielding the observation of a so-called beta rebound, on average across trials).

      Overall, and despite substantial methodological explorations and the description of two modes of propagation, the study falls short of advancing our understanding of the functional role of movement related beta bursts.

      For these reasons, the expected impact of the study on the field may be limited. The data is also relatively limited (simple button presses), in terms of behavioral features that could be related to the neurophysiological observations. One missed opportunity to explain the functional role of the distinct propagation patterns reports would have been, for instance, to measure the cortical "destination" of their respective trajectories.

      In response to this comment, we would like to highlight two important points.

      First, our work constitutes the first non-invasive human confirmation of invasive work in animals (Balasubramanian et al., 2020; Roberts et al., 2019; Rule et al., 2018; (Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and patients (Takahashi et al., 2011). Thus, these results bridges between recordings limited to the size of multielectrode arrays (roughly 0.16 cm2; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Takahashi et al., 2011, 2015) and human EEG recordings spanning across large areas of the cortex and several functionally distinct regions (Alexander et al., 2016; Stolk et al., 2019). The ability to access these neural signatures non- invasively is important for cross-species comparison. This further enables us, to provide an in-depth analysis of the spatiotemporal diversity of human MEG signals and a detailed characterisation of the two propagation directions, which significantly extends previous reports. We note that their functional role remains undetermined also in these animal studies, but being able to identify these signals now in humans can provide a steppingstone for identifying their role.

      Second, and related, the reviewers are correct that we did not observe distinct propagation directions between pre- and post-movement bursts, nor a relationship with reaction time. However, such a null result would be relevant, in our view, towards understanding what the functional relevance of these signals, if any, might be. Recent work in macaques indicates that the spatiotemporal patterns of high-gamma activity carry kinematic information about the upcoming movement (Liang et al 2023). The functional role of beta may therefore be more complex and not relate to reaction times or kinematics in a straightforward manner. We believe this is a relevant observation, and in keeping with the continued efforts to identify how sensorimotor beta relates to behaviour. It is increasingly clear that spatiotemporal diversity in animal recordings and human E/MEG and intracranial recordings can constitute a substantial proportion of the measured dynamics. As such, our report is relevant in narrowing down what these signals may reflect.

      Together, we think that our work provides new insights into the multidimensional and propagating features of burst activity. This is important for the entire electrophysiology community, as it transforms how we commonly analyse and interpret these important brain signals. We anticipate that our work will guide and inspire future work on the mechanistic underpinnings of these dominant neural signals. We are confident that our article has the scope to reach out to the diverse readership of eLife.

      Reviewer #2 (Public Review):

      The authors devised novel and interesting experiments using high precision human MEG to demonstrate the propagation of beta oscillation events along two axes in the brain. Using careful analysis, they show different properties of beta events pre- and post movement, including changes in amplitude. Due to beta's prominent role in motor system dynamics, these changes are therefore linked to behavior and offer insights into the mechanisms leading to movement. The linking of wave-like phenomena and transient dynamics in the brain offers new insight into two paradigms about neural dynamics, offering new ways to think about each phenomena on its own.

      Although there is a substantial, and recent, body of literature supporting the conclusions that beta and other neural oscillations are transient, care must be taken when analyzing the data and the resulting conclusions about beta properties in both time and space. For example, modifying the threshold at which beta events are detected could alter their reported properties and expression in space and time. The authors should therefore performing parameter sweeps on e.g. the thresholds for detection of oscillation bursts to determine whether their conclusions on beta properties and propagation hold. If this additional analysis does not change their story, it would lend confidence in the results/conclusions.

      We thank the reviewing team for this comment. As suggested, we evaluated the effect of different burst thresholds on the burst parameters.

      The threshold in the main analysis was determined empirically from the data, as in previous work (Little et al., 2019). Specifically, trial-wise power was correlated with the burst probability across a range of different threshold values (from median to median plus seven standard deviations (std), in steps of 0.25, see Figure 6-figure supplement 1). The threshold value that retained the highest correlation between trial-wise power and burst probability was used to binarize the data.

      We repeated our original analysis using four additional thresholds, i.e., original threshold - 0.5 std, -0.25 std, +0.25 std, +0.5 std. As one would expect, burst threshold is negatively related to the number of bursts (i.e., higher thresholds yield fewer bursts, Figure R4a [top]), and positively related to burst amplitude (i.e., higher thresholds yield higher burst amplitudes, Figure R4a [bottom]).

      Similarly, the temporal duration of bursts and apparent spatial width are modulated by the burst threshold: lowering the threshold leads to longer temporal duration and larger apparent spatial width while increasing the threshold leads to shorter temporal duration and smaller apparent spatial width Figure R4b. Note that for the temporal and spectral burst characteristics, the difference to the original threshold can be numerically zero, i.e., changing the burst threshold did not lead to changes exceeding the temporal and spectral resolution of the applied time-frequency transformation (i.e., 200ms and 1Hz respectively).

      Importantly, across these threshold values, the propagation direction and propagation speed remain comparable.

      We now include this result as Figure 6-figure supplement 2and refer to this analysis in the manuscript (page 28 line 717).

      “To explore the robustness of the results analyses were repeated using a range of thresholds (Figure 6-figure supplement 2).”

      Determining the generators of beta events at different locations is a tricky issue. The authors mentioned a single generator that is responsible for propagating beta along the two axes described. However, it is not clear through what mechanism the beta events could travel along the neural substrate without additional local generators along the way. Previous work on beta events examined how a sequence of synaptic inputs to supra and infragranular layers would contribute to a typical beta event waveform. Although it is possible other mechanisms exist, how might this work as the beta events propagate through space? Some further explanation/investigation on these issues is therefore warranted.

      Based on this and other comments (i.e., comments 7 and 8) we re-evaluated the use of the term ‘generator’ in this manuscript.

      While the term generator can be used across scales, from micro- to macroscale, ifor the purpose of the present paper, we believe one should differentiate at least two concepts: a) generator of beta bursts, and b) generator of travelling waves.

      We realised that in the previous version of the manuscript the term ‘generator’ was at times used without context. We removed the term where no longer necessary.

      Further, the previous version of the manuscript discussed putative generators of travelling waves (page 19f.) but not generators of beta bursts. We now address this as follows:

      “Studies using biophysical modelling have proposed that beta bursts are generated by a broad infragranular excitatory synaptic drive temporally aligned with a strong supragranular synaptic drive (Law et al., 2022; Neymotin et al., 2020; Sherman et al., 2016; Shin et al., 2017) whereby layer specific inhibition acts to stabilise beta bursts in the temporal domain (West et al., 2023). The supragranular drive is thought to originate in the thalamus (E. G. Jones, 1998, 2001; Mo & Sherman, 2019; Seedat et al., 2020), indicating thalamocortical mechanisms (page 22f).”

      Once the mechanisms have been better understood, a question of how much the results generalize to other oscillation frequencies and other brain areas. On the first question of other oscillation frequencies, the authors could easily test whether nearby frequency bands (alpha and low gamma) have similar properties. This would help to determine whether the observations/conclusions are unique to beta, or more generally applicable to transient bursts/waves in the brain. On the second issue of applicability to other brain areas, the authors could relate their work to transient bursts and waves recorded using ECoG and/or iEEG. Some recent work on traveling waves at the brain-wide level would be relevant for such comparisons.

      We appreciate the enthusiasm and the suggestions. To comment on the frequency specificity of the observed effects we conducted the same analysis focusing on the gamma frequency range (60-90 Hz). For computational reasons, we limited this analysis to one subject. Figure R1 shows the polar probability histogram for the beta frequency range (left) and the gamma frequency range (right). In contrast to the beta frequency range, no dominant directions were observed for the gamma range and von Mises functions did not converge. These preliminary results suggest some frequency specificity of the spatiotemporal pattern in sensorimotor beta activity. We believe this paves the way for future analysis mapping propagation direction across frequency and space.

      Here we did not investigate the spatial specificity of the effects, as the beta frequency range is dominant in sensorimotor areas. Investigating beta bursts in other cortical areas would have likely resulted in very few bursts. We discuss our results across spatial scales in the section: Distinct anatomical propagation axes of sensorimotor beta activity. However, please note that most of the previous literature operates on a different spatial scale (roughly 4mm; Balasubramanian et al., 2020; Best et al., 2016; Rubino et al., 2006; Rule et al., 2018; Takahashi et al., 2011, 2015) and different species (e.g., non-human primates). Non-invasive recordings in humans capture temporospatial patterns of a very different scale, i.e., often across the whole cortex (Alexander et al., 2016; Roberts et al., 2019). Comparing spatiotemporal patterns, across different spatial scales is inherently difficult. Work

      investigating different spatial scales simultaneously, such as Sreekumar et al. 2020, is required to fully unpack the relationship between mesoscopic and macroscopic spatiotemporal patterns.

      Figure R1: Spatiotemporal organisation for the beta (β, 13-30Hz) and gamma (γ, 60-90) frequency range for one exemplar subject. Same as Figure 4a, but for one exemplar subject.

      If the source code could be provided on github along with documentation and a standard "notebook" on use other researchers would benefit greatly.

      All analyses are performed using freely available tools in MATLAB. The code carrying out the analysis in this paper can be found here: [link provided upon acceptance]. The 3D burst analyses can be very computationally intensive even on a modern computer system. The analyses in this paper were computed on a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 and 32 Gb of RAM. Details on the installation and setup of the dependencies can be found in the README.md file in the main study repository.

      This information has been added to the paper in the methods section on page 35.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript provides a comprehensive investigation of the effects of the genetic ablation of three different transcription factors (Srf, Mrtfa, and Mrtfb) in the inner ear hair cells. Based on the published data, the authors hypothesized that these transcription factors may be involved in the regulation of the genes essential for building the actin-rich structures at the apex of hair cells, the mechanosensory stereocilia and their mechanical support - the cuticular plate. Indeed, the authors found that two of these transcription factors (Srf and Mrtfb) are essential for the proper formation and/or maintenance of these structures in the auditory hair cells. Surprisingly, Srf- and Mrtfb- deficient hair cells exhibited somewhat similar abnormalities in the stereocilia and in the cuticular plates even though these transcription factors have very different effects on the hair cell transcriptome. Another interesting finding of this study is that the hair cell abnormalities in Srfdeficient mice could be rescued by AAV-mediated delivery of Cnn2, one of the downstream targets of Srf. However, despite a rather comprehensive assessment of the novel mouse models, the authors do not have yet any experimentally testable mechanistic model of how exactly Srf and Mrtfb contribute to the formation of actin cytoskeleton in the hair cells. The lack of any specific working model linking Srf and/or Mrtfb with stereocilia formation decreases the potential impact of this study.

      Major comments:

      Figures 1 & 3: The conclusion on abnormalities in the actin meshwork of the cuticular plate was based largely on the comparison of the intensities of phalloidin staining in separate samples from different groups. In general, any comparison of the intensity of fluorescence between different samples is unreliable, no matter how carefully one could try matching sample preparation and imaging conditions. In this case, two other techniques would be more convincing: 1) quantification of the volume of the cuticular plates from fluorescent images; and 2) direct examination of the cuticular plates by transmission electron microscopy (TEM).

      In fact, the manuscript provides no single TEM image of the F-actin abnormalities either in the cuticular plate or in the stereocilia, even though these abnormalities seem to be the major focus of the study. Overall, it is still unclear what exactly Srf or Mrtfb deficiencies do with F-actin in the hair cells.

      Yes, we agree. As suggested by the reviewer, to directly examine the defects in F-actin organization within the cuticular plate of mutant mice, we conducted Transmission Electron Microscopy (TEM) analyses. The results, as presented in the revised Figures 1 and 4 (panels F, G, and E, F, respectively), provide crucial insights into the structural changes in the cuticular plate. Meanwhile, the comparison of the volume of the phalloidin labeled cuticular plate after 3-D reconstruction using Imaris software was conducted and shown in Author response image 1. The results of the cuticular plate (CP) volume were consistent with the relative F-actin intensity change of the cuticular plate in the revised Figures 1B and 4B. For the TEM analysis of the stereocilia, we regret that due to time constraints, we were unable to collect TEM images of stereocilia with sufficient quality for a meaningful comparison. However, we believe that the data we have presented sufficiently addresses the primary concerns, and we appreciate the reviewers’ understanding of these limitations.

      Author response image 1.

      Figures 2 & 4 represent another example of how deceiving could be a simple comparison of the intensity of fluorescence between the genotypes. It is not clear whether the reduced immunofluorescence of the investigated molecules (ESPN1, EPS8, GNAI3, or FSCN2) results from their mis-localization or represents a simple consequence of the fact that a thinner stereocilium would always have a smaller signal of the protein of interest, even though the ratio of this protein to the number of actin filaments remains unchanged. According to my examination of the representative images of these figures, loss of Srf produces mis-localization of the investigated proteins and irregular labeling in different stereocilia of the same bundle, while loss of Mrtfb does not. Obviously, a simple quantification of the intensity of fluorescence conceals these important differences.

      Yes, we agree. In addition to the quantification of tip protein intensity, we have added a few more analyses in the revised Figure 3 and Figure 6, such as the percentage of row 1 tip stereocilia with tip protein staining and the percentage of IHCs with tip protein staining on row 2 tip. Using the results mentioned above, the differences in the expression level, the row-specific distribution and the irregular labeling of tip proteins between the control and the mutants can be analyzed more thoroughly.

      Reviewer #2 (Public Review):

      The analysis of bundle morphology using both confocal and SEM imaging is a strength of the paper and the authors have some nice images, especially with SEM. Still, the main weakness is that it is unclear how significant their findings are in terms of understanding bundle development; the mouse phenotypes are not distinct enough to make it clear that they serve different functions so the reader is left wondering what the main takeaway is.

      Based on the reviewer’s comments, in this revised manuscript, we put more emphasis on describing the effects of SRF and MRTFB on key tip proteins’ localization pattern during stereocilia development, represented by ESPN1, EPS8 and GNAI3, as well as the effects of SRF and MRTFB on the F-actin organization of cuticular plate using TEM. We have made substantial efforts to interpret the mechanistic underpinnings of the roles of SRF and MRTFB in hair cells. This is reflected in the revised Figures 1, 3, 4, 6, and 10, where we provide more comprehensive insights into the mechanisms at play.

      We interpret our data in a way that both SRF and MRTF regulate the development and maintenance of the hair cell’s actin cytoskeleton in a complementary manner. Deletion of either gene thus results in somewhat similar phenotypes in hair cell morphology, despite the surprising lack of overlap of SRF and MRTFB downstream targets in the hair cell.

      In Figure 1 and 3, changes in bundle morphology clearly don't occur until after P5. Widening still occurs to some extent but lengthening does not and instead the stereocilia appear to shrink in length. EPS8 levels appear to be the most reduced of all the tip proteins (Srf mutants) so I wonder if these mutants are just similar to an EPS8 KO if the loss of EPS8 occurred postnatally (P0-P5).

      To address this question, we performed EPS8 staining on the control and Srf cKO hair cells at P4 and P10. We found that the dramatic decrease of the row 1 tip signal for EPS8 started since P4 in Srf cKO IHCs. Although the major hair bundle phenotype of Eps8 KO, including the defects of row 1 stereocilia lengthening and additional rows of short stereocilia also appeared in Srf cKO IHCs, there are still some bundle morphology differences between Eps8 KO and Srf cKO. For example, firstly, both Eps8 KO OHCs and IHCs showed additional rows of short stereocilia, but we only observed additional rows of short stereocilia in Srf cKO IHCs. Secondly, in Valeria Zampini’s study, SEM and TEM images did not show an obvious reduction of row 2 stereocilia widening (P18-P35), while our analysis of SEM images confirmed that the width of row 2 IHC stereocilia was drastically reduced by 40% in Srf cKO (P15). Generally, we think although Srf cKO hair bundles are somewhat similar to Eps8 KO, the Srf cKO hair bundle phenotype might be governed by multiple candidate genes cooperatively.

      Reference:

      Valeria Zampini, et al. Eps8 regulates hair bundle length and functional maturation of mammalian auditory hair cells. PLoS Biol. 2011 Apr;9(4): e1001048.

      A major shortcoming is that there are few details on how the image analyses were done. Were SEM images corrected for shrinkage? How was each of the immunocytochemistry quantitation (e.g., cuticular plates for phalloidin and tip staining for antibodies) done? There are multiple ways of doing this but there are few indications in the manuscript.

      We apologize for not making the description of the procedure of images analyses clear enough. As described in Nicolas Grillet group’s study, live and mildly-fixed IHC stereocilia have similar dimensions, while SEM preparation results in a hair bundle at a 2:3 scale compared to the live preparation. In our study, the hair cells selected for SEM imaging and measurements were located in the basal turn (30-32kHz), while the hair cells selected for fluorescence-based imaging and measurements were located in the middle turn (20-24kHz) or the basal turn (32-36kHz). Although our SEM imaging and fluorescence-based imaging of basal turn’s hair bundles were not from the same area exactly, the control hair bundles with SEM imaging have reduced row 1 stereocilia length by 10%-20%, compared to the control hair bundles with fluorescence-based imaging (revised Figure 2 and Figure 5). Generally, our stereocilia dimensions data showed appropriate shrinkage caused by the SEM preparation.

      Recognizing the need for clarity, we have provided a detailed description of our image quantification and analysis procedures in the “Materials and Methods” section, specifically under “Immunocytochemistry.” This will aid readers in understanding our methodologies and ensure transparency in our approach.

      Reference:

      Katharine K Miller, et al. Dimensions of a Living Cochlear Hair Bundle. Front Cell Dev Biol. 2021 Nov 25:9:742529.

      The tip protein analysis in Figs 2 and 4 is nice but it would be nice for the authors to show the protein staining separately from the phalloidin so you could see how restricted to the tips it is (each in grayscale). This is especially true for the CNN2 labeling in Fig 7 as it does not look particularly tip specific in the x-y panels. It would be especially important to see the antibody staining in the reslices separate from phalloidin.

      Thank you for the suggestions. We have shown tip proteins staining in grayscale separately from the phalloidin in the revised Figure 3 and Figure 6. To clearly show the tip-specific localization of CNN2, we conducted CNN2 staining at different ages during hair bundle development and showed CNN2 labeling in grayscale and in reslices in revised Figure 9-figure supplement 1B.

      In Fig 6, why was the transcriptome analysis at P2 given that the phenotype in these mice occurs much later? While redoing the transcriptome analysis is probably not an option, an alternative would be to show more examples of EPS8/GNAI/CNN2 staining in the KO, but at younger ages closer to the time of PCR analysis, such as at P5. Pinpointing when the tip protein intensities start to decrease in the KOs would be useful rather than just showing one age (P10).

      We agree with the reviewer. To address this question, we have performed ESPN1, EPS8 and GNAI3 staining on the control and the mutant’s hair cells at P4, P10 and P15 (the revised Figures 3 and 6). According to the new results, we found that the dramatic decreases of the row 1 tip signal for ESPN1 and EPS8 started since P4 in Srf cKO IHCs, is consistent with the appearance of the mild reduction of row 1 stereocilia length in P5 Srf cKO IHCs. For Mrtfb cKO hair cells, the obvious reduction of the row 1 tip signal for ESPN1 was observed until P10. However, a few genes related to cell adhesion and regulation of actin cytoskeleton were significantly down-regulated in P2 Mrtfb deficient hair cell transcriptome. We think that in hair cells the MRTFB may not play a major role in the regulation of stereocilia development, so the morphological defects of stereocilia happened much later in the Mrtfb mutant than in the Srf mutant.

      While it is certainly interesting if it turns out CNN2 is indeed at tips in this phase, the experiments do not tell us that much about what role CNN2 may be playing. It is notable that in Fig 7E in the control+GFP panel, CNN2 does not appear to be at the tips. Those images are at P11 whereas the images in panel A are at P6 so perhaps CNN2 decreases after the widening phase. An important missing control is the Anc80L65-Cnn2 AAV in a wild-type cochlea.

      We agree with the reviewer. We have conducted more immunostaining experiments to confirm the expression pattern of CNN2 during the stereocilia development, from P0 to P11. The results were included in the revised Figure 9-figure supplement 1B. As the reviewer suggested, CNN2 expression pattern in control cochlea injected with Anc80L65-Cnn2 AAV has also been provided in revised Figure 9E.

    1. Author response:

      Reviewer #1 (Public Review):

      In this paper, Tompary & Davachi present work looking at how memories become integrated over time in the brain, and relating those mechanisms to responses on a priming task as a behavioral measure of memory linkage. They find that remotely but not recently formed memories are behaviorally linked and that this is associated with a change in the neural representation in mPFC. They also find that the same behavioral outcomes are associated with the increased coupling of the posterior hippocampus with category-sensitive parts of the neocortex (LOC) during a post-learning rest period-again only for remotely learned information. There was also correspondence in rest connectivity (posterior hippocampus-LOC) and representational change (mPFC) such that for remote memories specifically, the initial post-learning connectivity enhancement during rest related to longer-term mPFC representational change.

      This work has many strengths. The topic of this paper is very interesting, and the data provide a really nice package in terms of providing a mechanistic account of how memories become integrated over a delay. The paper is also exceptionally well-written and a pleasure to read. There are two studies, including one large behavioral study, and the findings replicate in the smaller fMRI sample. I do however have two fairly substantive concerns about the analytic approach, where more data will be required before we can know whether the interpretations are an appropriate reflection of the findings. These and other concerns are described below.

      Thank you for the positive comments! We are proud of this work, and we feel that the paper is greatly strengthened by the revisions we made in response to your feedback. Please see below for specific changes that we’ve made.

      1) One major concern relates to the lack of a pre-encoding baseline scan prior to recent learning.

      a) First, I think it would be helpful if the authors could clarify why there was no pre-learning rest scan dedicated to the recent condition. Was this simply a feasibility consideration, or were there theoretical reasons why this would be less "clean"? Including this information in the paper would be helpful for context. Apologies if I missed this detail in the paper.

      This is a great point and something that we struggled with when developing this experiment. We considered several factors when deciding whether to include a pre-learning baseline on day two. First, the day 2 scan session was longer than that of day 1 because it included the recognition priming and explicit memory tasks, and the addition of a baseline scan would have made the length of the session longer than a typical scan session – about 2 hours in the scanner in total – and we were concerned that participant engagement would be difficult to sustain across a longer session. Second, we anticipated that the pre-learning scan would not have been a ‘clean’ measure of baseline processing, but rather would include signal related to post-learning processing of the day 1 sequences, as multi-variate reactivation of learned stimuli have been observed in rest scans collected 24-hours after learning (Schlichting & Preston, 2014). We have added these considerations to the Discussion (page 39, lines 1047-1070).

      b) Second, I was hoping the authors could speak to what they think is reflected in the post-encoding "recent" scan. Is it possible that these data could also reflect the processing of the remote memories? I think, though am not positive, that the authors may be alluding to this in the penultimate paragraph of the discussion (p. 33) when noting the LOC-mPFC connectivity findings. Could there be the reinstatement of the old memories due to being back in the same experimental context and so forth? I wonder the extent to which the authors think the data from this scan can be reflected as strictly reflecting recent memories, particularly given it is relative to the pre-encoding baseline from before the remote memories, as well (and therefore in theory could reflect both the remote + recent). (I should also acknowledge that, if it is the case that the authors think there might be some remote memory processing during the recent learning session in general, a pre-learning rest scan might not have been "clean" either, in that it could have reflected some processing of the remote memories-i.e., perhaps a clean pre-learning scan for the recent learning session related to point 1a is simply not possible.)

      We propose that theoretically, the post-learning recent scan could indeed reflect mixture of remote and recent sequences. This is one of the drawbacks of splitting encoding into two sessions rather than combining encoding into one session and splitting retrieval into an immediate and delayed session; any rest scans that are collected on Day 2 may have signal that relates to processing of the Day 1 remote sequences, which is why we decided against the pre-learning baseline for Day 2, as you had noted.

      You are correct that we alluded to in our original submission when discussing the LOC-mPFC coupling result, and we have taken steps to discuss this more explicitly. In Brief, we find greater LOC-mPFC connectivity only after recent learning relative to the pre-learning baseline, and cortical-cortical connectivity could be indicative of processing memories that already have undergone some consolidation (Takashima et al., 2009; Smith et al., 2010). From another vantage point, the mPFC representation of Day 1 learning may have led to increased connectivity with LOC on Day 2 due to Day 1 learning beginning to resemble consolidated prior knowledge (van Kesteren et al., 2010). While this effect is consistent with prior literature and theory, it's unclear why we would find evidence of processing of the remote memories and not the recent memories. Furthermore, the change in LOC-mPFC connectivity in this scan did not correlate with memory behaviors from either learning session, which could be because signal from this scan reflects a mix of processing of the two different learning sessions. With these ideas in mind, we have fleshed out the discussion of the post-encoding ‘recent’ scan in the Discussion (page 38-39, lines 1039-1044).

      c) Third, I am thinking about how both of the above issues might relate to the authors' findings, and would love to see more added to the paper to address this point. Specifically, I assume there are fluctuations in baseline connectivity profile across days within a person, such that the pre-learning connectivity on day 1 might be different from on day 2. Given that, and the lack of a pre-learning connectivity measure on day 2, it would logically follow that the measure of connectivity change from pre- to post-learning is going to be cleaner for the remote memories. In other words, could the lack of connectivity change observed for the recent scan simply be due to the lack of a within-day baseline? Given that otherwise, the post-learning rest should be the same in that it is an immediate reflection of how connectivity changes as a function of learning (depending on whether the authors think that the "recent" scan is actually reflecting "recent + remote"), it seems odd that they both don't show the same corresponding increase in connectivity-which makes me think it may be a baseline difference. I am not sure if this is what the authors are implying when they talk about how day 1 is most similar to prior investigation on p. 20, but if so it might be helpful to state that directly.

      We agree that it is puzzling that we don’t see that hippocampal-LOC connectivity does not also increase after recent learning, equivalently to what we see after remote learning. However, the fact that there is an increase from baseline rest to post-recent rest in mPFC – LOC connectivity suggests that it’s not an issue with baseline, but rather that the post-recent learning scan is reflecting processing of the remote memories (although as a caveat, there is no relationship with priming).

      On what is now page 23, we were referring to the notion that the Day 1 procedure (baseline rest, learning, post-learning rest) is the most straightforward replication of past work that finds a relationship between hippocampal-cortical coupling and later memory. In contrast, the Day 2 learning and rest scan are less ‘clean’ of a replication in that they are taking place in the shadow of Day 1 learning. We have clarified this in the Results (page 23, lines 597-598).

      d) Fourth and very related to my point 1c, I wonder if the lack of correlations for the recent scan with behavior is interpretable, or if it might just be that this is a noisy measure due to imperfect baseline correction. Do the authors have any data or logic they might be able to provide that could speak to these points? One thing that comes to mind is seeing whether the raw post-learning connectivity values (separately for both recent and remote) show the same pattern as the different scores. However, the authors may come up with other clever ways to address this point. If not, it might be worth acknowledging this interpretive challenge in the Discussion.

      We thought of three different approaches that could help us to understand whether the lack of correlations in between coupling and behavior in the recent scan was due to noise. First, we correlated recognition priming with raw hippocampal-LOC coupling separately for pre- and post-learning scans, as in Author response image 1:

      Author response image 1.

      Note that the post-learning chart depicts the relationship between post-remote coupling and remote priming and between post-recent coupling and recent priming (middle). Essentially, post-recent learning coupling did not relate to priming of recently learned sequences (middle; green) while there remains a trend for a relationship between post-remote coupling and priming for remotely learned sequences (middle; blue). However, the significant relationship between coupling and priming that we reported in the paper (right, blue) is driven both by the initial negative relationship that is observed in the pre-learning scan and the positive relationship in the post-remote learning scan. This highlights the importance of using a change score, as there may be spurious initial relationships between connectivity profiles and to-be-learned information that would then mask any learning- and consolidation-related changes.

      We also reasoned that if comparisons between the post-recent learning scan and the baseline scan are noisier than between the post-remote learning and baseline scan, there may be differences in the variance of the change scores across participants, such that changes in coupling from baseline to post-recent rest may be more variable than coupling from baseline to post-remote rest. We conducted F-tests to compare the variance of the change in these two hippocampal-LO correlations and found no reliable difference (ratio of difference: F(22, 22) = 0.811, p = .63).

      Finally, we explored whether hippocampal-LOC coupling is more stable across participants if compared across two rest scans within the same imaging session (baseline and post-remote) versus across two scans across two separate sessions (baseline and post-recent). Interestingly, coupling was not reliably correlated across scans in either case (baseline/post-remote: r = 0.03, p = 0.89 Baseline/post-recent: r = 0.07, p = .74).

      Finally, we evaluated whether hippocampal-LOC coupling was correlated across different rest scans (see Author response image 2). We reasoned that if such coupling was more correlated across baseline and post-remote scans relative to baseline and post-recent scans, that would indicate a within-session stability of participants’ connectivity profiles. At the same time, less correlation of coupling across baseline and post-recent scans would be an indication of a noisier change measure as the measure would additionally include a change in individuals’ connectivity profile over time. We found that there was no difference in the correlation of hipp-LO coupling is across sessions, and the correlation was not reliably significant for either session (baseline/post-remote: r = 0.03, p = 0.89; baseline/post-recent: r = 0.07, p = .74; difference: Steiger’s t = 0.12, p = 0.9).

      Author response image 2.

      We have included the raw correlations with priming (page 25, lines 654-661, Supplemental Figure 6) as well as text describing the comparison of variances (page 25, lines 642-653). We did not add the comparison of hippocampal-LOC coupling across scans to the current manuscript, as an evaluation of stability of such coupling in the context of learning and reactivation seems out of scope of the current focus of the experiment, but we find this result to be worthy of follow-up in future work.

      In summary, further analysis of our data did not reveal any indication that a comparison of rest connectivity across scan sessions inserted noise into the change score between baseline and post-recent learning scans. However, these analyses cannot fully rule that possibility out, and the current analyses do not provide concrete evidence that the post-recent learning scan comprises signals that are a mixture of processing of recent and remote sequences. We discuss these drawbacks in the Discussion (page 39, lines 1047-1070).

      2) My second major concern is how the authors have operationalized integration and differentiation. The pattern similarity analysis uses an overall correspondence between the neural similarity and a predicted model as the main metric. In the predicted model, C items that are indirectly associated are more similar to one another than they are C items that are entirely unrelated. The authors are then looking at a change in correspondence (correlation) between the neural data and that prediction model from pre- to post-learning. However, a change in the degree of correspondence with the predicted matrix could be driven by either the unrelated items becoming less similar or the related ones becoming more similar (or both!). Since the interpretation in the paper focuses on change to indirectly related C items, it would be important to report those values directly. For instance, as evidence of differentiation, it would be important to show that there is a greater decrease in similarity for indirectly associated C items than it is for unrelated C items (or even a smaller increase) from pre to post, or that C items that are indirectly related are less similar than are unrelated C items post but not pre-learning. Performing this analysis would confirm that the pattern of results matches the authors' interpretation. This would also impact the interpretation of the subsequent analyses that involve the neural integration measures (e.g., correlation analyses like those on p. 16, which may or may not be driven by increased similarity among overlapping C pairs). I should add that given the specificity to the remote learning in mPFC versus recent in LOC and anterior hippocampus, it is clearly the case that something interesting is going on. However, I think we need more data to understand fully what that "something" is.

      We recognize the importance of understanding whether model fits (and changes to them) are driven by similarity of overlapping pairs or non-overlapping pairs. We have modified all figures that visualize model fits to the neural integration model to separately show fits for pre- and post-learning (Figure 3 for mPFC, Supp. Figure 5 for LOC, Supp. Figure 9 for AB similarity in anterior hippocampus & LOC). We have additionally added supplemental figures to show the complete breakdown of similarity each region in a 2 (pre/post) x 2 (overlapping/non-overlapping sequence) x 2 (recent/remote) chart. We decided against including only these latter charts rather than the model fits since the model fits strike a good balance between information and readability. We have also modified text in various sections to focus on these new results.

      In brief, the decrease in model fit for mPFC for the remote sequences was driven primarily by a decrease in similarity for the overlapping C items and not the non-overlapping ones (Supplementary Figure 3, page 18, lines 468-472).

      Interestingly, in LOC, all C items grew more similar after learning, regardless of their overlap or learning session, but the increase in model fit for C items in the recent condition was driven by a larger increase in similarity for overlapping pairs relative to non-overlapping ones (Supp. Figure 5, page 21, lines 533-536).

      We also visualized AB similarity in the anterior hippocampus and LOC in a similar fashion (Supplementary Figure 9).

      We have also edited the Methods sections with updated details of these analyses (page 52, lines 1392-1397). We think that including these results considerably strengthen our claims and we are pleased to have them included.

      3) The priming task occurred before the post-learning exposure phase and could have impacted the representations. More consideration of this in the paper would be useful. Most critically, since the priming task involves seeing the related C items back-to-back, it would be important to consider whether this experience could have conceivably impacted the neural integration indices. I believe it never would have been the case that unrelated C items were presented sequentially during the priming task, i.e., that related C items always appeared together in this task. I think again the specificity of the remote condition is key and perhaps the authors can leverage this to support their interpretation. Can the authors consider this possibility in the Discussion?

      It's true that only C items from the same sequence were presented back-to-back during the priming task, and that this presentation may interfere with observations from the post-learning exposure scan that followed it. We agree that it is worth considering this caveat and have added language in the Discussion (page 40, lines 1071-1086). When designing the study, we reasoned that it was more important for the behavioral priming task to come before the exposure scans, as all items were shown only once in that task, whereas they were shown 4-5 times in a random order in the post-learning exposure phase. Because of this difference in presentation times, and because behavioral priming findings tend to be very sensitive, we concluded that it was more important to protect the priming task from the exposure scan instead of the reverse.

      We reasoned, however, that the additional presentation of the C items in the recognition priming task would not substantially override the sequence learning, as C items were each presented 16 times in their sequence (ABC1 and ABC2 16 times each). Furthermore, as this reviewer suggests, the order of C items during recognition was the same for recent and remote conditions, so the fact that we find a selective change in neural representation for the remote condition and don’t also see that change for the recent condition is additional assurance that the recognition priming order did not substantially impact the representations.

      4) For the priming task, based on the Figure 2A caption it seems as though every sequence contributes to both the control and primed conditions, but (I believe) this means that the control transition always happens first (and they are always back-to-back). Is this a concern? If RTs are changing over time (getting faster), it would be helpful to know whether the priming effects hold after controlling for trial numbers. I do not think this is a big issue because if it were, you would not expect to see the specificity of the remotely learned information. However, it would be helpful to know given the order of these conditions has to be fixed in their design.

      This is a correct understanding of the trial orders in the recognition priming task. We chose to involve the baseline items in the control condition to boost power – this way, priming of each sequence could be tested, while only presenting each item once in this task, as repetition in the recognition phase would have further facilitated response times and potentially masked any priming effects. We agree that accounting for trial order would be useful here, so we ran a mixed-effects linear model to examine responses times both as a function of trial number and of priming condition (primed/control). While there is indeed a large effect of trial number such that participants got faster over time, the priming effect originally observed in the remote condition still holds at the same time. We now report this analysis in the Results section (page 14, lines 337-349 for Expt 1 and pages 14-15, lines 360-362 for Expt 2).

      5) The authors should be cautious about the general conclusion that memories with overlapping temporal regularities become neurally integrated - given their findings in MPFC are more consistent with overall differentiation (though as noted above, I think we need more data on this to know for sure what is going on).

      We realize this conclusion was overly simplistic and, in several places, have revised the general conclusions to be more specific about the nuanced similarity findings.

      6) It would be worth stating a few more details and perhaps providing additional logic or justification in the main text about the pre- and post-exposure phases were set up and why. How many times each object was presented pre and post, and how the sequencing was determined (were any constraints put in place e.g., such that C1 and C2 did not appear close in time?). What was the cover task (I think this is important to the interpretation & so belongs in the main paper)? Were there considerations involving the fact that this is a different sequence of the same objects the participants would later be learning - e.g., interference, etc.?

      These details can be found in the Methods section (pages 50-51, lines 1337-1353) and we’ve added a new summary of that section in the Results (page 17, lines 424- 425 and 432-435). In brief, a visual hash tag appeared on a small subset of images and participants pressed a button when this occurred, and C1 and C2 objects were presented in separate scans (as were A and B objects) to minimize inflated neural similarity due to temporal proximity.

      Reviewer #2 (Public Review):

      The manuscript by Tompary & Davachi presents results from two experiments, one behavior only and one fMRI plus behavior. They examine the important question of how to separate object memories (C1 and C2) that are never experienced together in time and become linked by shared predictive cues in a sequence (A followed by B followed by one of the C items). The authors developed an implicit priming task that provides a novel behavioral metric for such integration. They find significant C1-C2 priming for sequences that were learned 24h prior to the test, but not for recently learned sequences, suggesting that associative links between the two originally separate memories emerge over an extended period of consolidation. The fMRI study relates this behavioral integration effect to two neural metrics: pattern similarity changes in the medial prefrontal cortex (mPFC) as a measure of neural integration, and changes in hippocampal-LOC connectivity as a measure of post-learning consolidation. While fMRI patterns in mPFC overall show differentiation rather than integration (i.e., C1-C2 representational distances become larger), the authors find a robust correlation such that increasing pattern similarity in mPFC relates to stronger integration in the priming test, and this relationship is again specific to remote memories. Moreover, connectivity between the posterior hippocampus and LOC during post-learning rest is positively related to the behavioral integration effect as well as the mPFC neural similarity index, again specifically for remote memories. Overall, this is a coherent set of findings with interesting theoretical implications for consolidation theories, which will be of broad interest to the memory, learning, and predictive coding communities.

      Strengths:

      1) The implicit associative priming task designed for this study provides a promising new tool for assessing the formation of mnemonic links that influence behavior without explicit retrieval demands. The authors find an interesting dissociation between this implicit measure of memory integration and more commonly used explicit inference measures: a priming effect on the implicit task only evolved after a 24h consolidation period, while the ability to explicitly link the two critical object memories is present immediately after learning. While speculative at this point, these two measures thus appear to tap into neocortical and hippocampal learning processes, respectively, and this potential dissociation will be of interest to future studies investigating time-dependent integration processes in memory.

      2) The experimental task is well designed for isolating pre- vs post-learning changes in neural similarity and connectivity, including important controls of baseline neural similarity and connectivity.

      3) The main claim of a consolidation-dependent effect is supported by a coherent set of findings that relate behavioral integration to neural changes. The specificity of the effects on remote memories makes the results particularly interesting and compelling.

      4) The authors are transparent about unexpected results, for example, the finding that overall similarity in mPFC is consistent with a differentiation rather than an integration model.

      Thank you for the positive comments!

      Weaknesses:

      1) The sequence learning and recognition priming tasks are cleverly designed to isolate the effects of interest while controlling for potential order effects. However, due to the complex nature of the task, it is difficult for the reader to infer all the transition probabilities between item types and how they may influence the behavioral priming results. For example, baseline items (BL) are interspersed between repeated sequences during learning, and thus presumably can only occur before an A item or after a C item. This seems to create non-random predictive relationships such that C is often followed by BL, and BL by A items. If this relationship is reversed during the recognition priming task, where the sequence is always BL-C1-C2, this violation of expectations might slow down reaction times and deflate the baseline measure. It would be helpful if the manuscript explicitly reported transition probabilities for each relevant item type in the priming task relative to the sequence learning task and discussed how a match vs mismatch may influence the observed priming effects.

      We have added a table of transition probabilities across the learning, recognition priming, and exposure scans (now Table 1, page 48). We have also included some additional description of the change in transition probabilities across different tasks in the Methods section. Specifically, if participants are indeed learning item types and rules about their order, then both the control and the primed conditions would violate that order. Since C1 and C2 items never appeared together, viewing C1 would give rise to an expectation of seeing a BL item, which would also be violated. This suggests that our priming effects are driven by sequence-specific relationships rather than learning of the probabilities of different item types. We’ve added this consideration to the Methods section (page 45, lines 1212-1221).

      Another critical point to consider (and that the transition probabilities do not reflect) is that during learning, while C is followed either by A or BL, they are followed by different A or BL items. In contrast, a given A is always followed by the same B object, which is always followed by one of two C objects. While the order of item types is semi-predictable, the order of objects (specific items) themselves are not. This can be seen in the response times during learning, such that response times for A and BL items are always slower than for B and C items. We have explained this nuance in the figure text for Table 1.

      2) The choice of what regions of interest to include in the different sets of analyses could be better motivated. For example, even though briefly discussed in the intro, it remains unclear why the posterior but not the anterior hippocampus is of interest for the connectivity analyses, and why the main target is LOC, not mPFC, given past results including from this group (Tompary & Davachi, 2017). Moreover, for readers not familiar with this literature, it would help if references were provided to suggest that a predictable > unpredictable contrast is well suited for functionally defining mPFC, as done in the present study.

      We have clarified our reasoning for each of these choices throughout the manuscript and believe that our logic is now much more transparent. For an expanded reasoning of why we were motivated to look at posterior and not anterior hippocampus, see pages 6-7, lines 135-159, and our response to R2. In brief, past research focusing on post-encoding connectivity with the hippocampus suggests that posterior aspect is more likely to couple with category-selective cortex after learning neutral, non-rewarded objects much like the stimuli used in the present study.

      We also clarify our reasoning for LOC over mPFC. While theoretically, mPFC is thought to be a candidate region for coupling with the hippocampus during consolidation, the bulk of empirical work to date has revealed post-encoding connectivity between the hippocampus and category-selective cortex in the ventral and occipital lobes (page 6, lines 123-134).

      As for the use of the predictable > unpredictable contrast for functionally defining cortical regions, we reasoned that cortical regions that were sensitive to the temporal regularities generated by the sequences may be further involved in their offline consolidation and long-term storage (Danker & Anderson, 2010; Davachi & Danker, 2013; McClelland et al., 1995). We have added this justification to the Methods section (page 18, lines 454-460).

      3) Relatedly, multiple comparison corrections should be applied in the fMRI integration and connectivity analyses whenever the same contrast is performed on multiple regions in an exploratory manner.

      We now correct for multiple comparisons using Bonferroni correction, and this correction depends on the number of regions in which each analysis is conducted. Please see page 55, lines 1483-1490, in the Methods section for details of each analysis.

      Reviewer #3 (Public Review):

      The authors of this manuscript sought to illuminate a link between a behavioral measure of integration and neural markers of cortical integration associated with systems consolidation (post-encoding connectivity, change in representational neural overlap). To that aim, participants incidentally encoded sequences of objects in the fMRI scanner. Unbeknownst to participants, the first two objects of the presented ABC triplet sequences overlapped for a given pair of sequences. This allowed the authors to probe the integration of unique C objects that were never directly presented in the same sequence, but which shared the same preceding A and B objects. They encoded one set of objects on Day 1 (remote condition), another set of objects 24 hours later (recent condition) and tested implicit and explicit memory for the learned sequences on Day 2. They additionally collected baseline and post-encoding resting-state scans. As their measure of behavioral integration, the authors examined reaction time during an Old/New judgement task for C objects depending on if they were preceded by a C object from an overlapping sequence (primed condition) versus a baseline object. They found faster reaction times for the primed objects compared to the control condition for remote but not recently learned objects, suggesting that the C objects from overlapping sequences became integrated over time. They then examined pattern similarity in a priori ROIs as a measure of neural integration and found that participants showing evidence of integration of C objects from overlapping sequences in the medial prefrontal cortex for remotely learned objects also showed a stronger implicit priming effect between those C objects over time. When they examined the change in connectivity between their ROIs after encoding, they also found that connectivity between the posterior hippocampus and lateral occipital cortex correlated with larger priming effects for remotely learned objects, and that lateral occipital connectivity with the medial prefrontal cortex was related to neural integration of remote objects from overlapping sequences.

      The authors aim to provide evidence of a relationship between behavioral and neural measures of integration with consolidation is interesting, important, and difficult to achieve given the longitudinal nature of studies required to answer this question. Strengths of this study include a creative behavioral task, and solid modelling approaches for fMRI data with careful control for several known confounds such as bold activation on pattern analysis results, motion, and physiological noise. The authors replicate their behavioral observations across two separate experiments, one of which included a large sample size, and found similar results that speak to the reliability of the observed behavioral phenomenon. In addition, they document several correlations between neural measures and task performance, lending functional significance to their neural findings.

      Thank you for this positive assessment of our study!

      However, this study is not without notable weaknesses that limit the strength of the manuscript. The authors report a behavioral priming effect suggestive of integration of remote but not recent memories, leading to the interpretation that the priming effect emerges with consolidation. However, they did not observe a reliable interaction between the priming condition and learning session (recent/remote) on reaction times, meaning that the priming effect for remote memories was not reliably greater than that observed for recent. In addition, the emergence of a priming effect for remote memories does not appear to be due to faster reaction times for primed targets over time (the condition of interest), but rather, slower reaction times for control items in the remote condition compared to recent. These issues limit the strength of the claim that the priming effect observed is due to C items of interest being integrated in a consolidation-dependent manner.

      We acknowledge that the lack of a day by condition interaction in the behavioral priming effect should discussed and now discuss this data in a more nuanced manner. While it’s true that the priming effect emerges due to a slowing of the control items over time, this slowing is consistent with classic time-dependent effects demonstrating slower response times for more delayed memories. The fact that the response times in the primed condition does not show this slowing can be interpreted as a protection against this slowing that would otherwise occur. Please see page 29, lines 758-766, for this added discussion.

      Similarly, the interactions between neural variables of interest and learning session needed to strongly show a significant consolidation-related effect in the brain were sometimes tenuous. There was no reliable difference in neural representational pattern analysis fit to a model of neural integration between the short and long delays in the medial prefrontal cortex or lateral occipital cortex, nor was the posterior hippocampus-lateral occipital cortex post-encoding connectivity correlation with subsequent priming significantly different for recent and remote memories. While the relationship between integration model fit in the medial prefrontal cortex and subsequent priming (which was significantly different from that occurring for recent memories) was one of the stronger findings of the paper in favor of a consolidation-related effect on behavior, is it possible that lack of a behavioral priming effect for recent memories due to possible issues with the control condition could mask a correlation between neural and behavioral integration in the recent memory condition?

      While we acknowledge that lack of a statistically reliable interaction between neural measures and behavioral priming in many cases, we are heartened by the reliable difference in the relationship between mPFC similarity and priming over time, which was our main planned prediction. In addition to adding caveats in the discussion about the neural measures and behavioral findings in the recent condition (see our response to R1.1 and R1.4 for more details), we have added language throughout the manuscript noting the need to interpret these data with caution.

      These limitations are especially notable when one considers that priming does not classically require a period of prolonged consolidation to occur, and prominent models of systems consolidation rather pertain to explicit memory. While the authors have provided evidence that neural integration in the medial prefrontal cortex, as well as post-encoding coupling between the lateral occipital cortex and posterior hippocampus, are related to faster reaction times for primed objects of overlapping sequences compared to their control condition, more work is needed to verify that the observed findings indeed reflect consolidation dependent integration as proposed.

      We agree that more work is needed to provide converging evidence for these novel findings. However, we wish to counter the notion that systems consolidation models are relevant only for explicit memories. Although models of systems consolidation often mention transformations from episodic to semantic memory, the critical mechanisms that define the models involve changes in the neural ensembles of a memory that is initially laid down in the hippocampus and is taught to cortex over time. This transformation of neural traces is not specific to explicit/declarative forms of memory. For example, implicit statistical learning initially depends on intact hippocampal function (Schapiro et al., 2014) and improves over consolidation (Durrant et al., 2011, 2013; Kóbor et al., 2017).

      Second, while there are many classical findings of priming during or immediately after learning, there are several instances of priming used to measure consolidation-related changes to newly learned information. For instance, priming has been used as a measure of lexical integration, demonstrating that new word learning benefits from a night of sleep (Wang et al., 2017; Gaskell et al., 2019) or a 1-week delay (Tamminen & Gaskell, 2013). The issue is not whether priming can occur immediately, it is whether priming increases with a delay.

      Finally, it is helpful to think about models of memory systems that divide memory representations not by their explicit/implicit nature, but along other important dimensions such as their neural bases, their flexibility vs rigidity, and their capacity for rapid vs slow learning (Henke, 2010). Considering this evidence, we suggest that systems consolidation models are most useful when considering how transformations in the underlying neural memory representation affects its behavioral expression, rather than focusing on the extent that the memory representation is explicit or implicit.

      With all this said, we have added text to the discussion reminding the reader that there was no statistically significant difference in priming as a function of the delay (page 29, lines 764 - 766). However, we are encouraged by the fact that the relationship between priming and mPFC neural similarity was significantly stronger for remotely learned objects relative to recently learned ones, as this is directly in line with systems consolidation theories.

      References

      Abolghasem, Z., Teng, T. H.-T., Nexha, E., Zhu, C., Jean, C. S., Castrillon, M., Che, E., Di Nallo, E. V., & Schlichting, M. L. (2023). Learning strategy differentially impacts memory connections in children and adults. Developmental Science, 26(4), e13371. https://doi.org/10.1111/desc.13371

      Dobbins, I. G., Schnyer, D. M., Verfaellie, M., & Schacter, D. L. (2004). Cortical activity reductions during repetition priming can result from rapid response learning. Nature, 428(6980), 316–319. https://doi.org/10.1038/nature02400

      Durrant, S. J., Cairney, S. A., & Lewis, P. A. (2013). Overnight consolidation aids the transfer of statistical knowledge from the medial temporal lobe to the striatum. Cerebral Cortex, 23(10), 2467–2478. https://doi.org/10.1093/cercor/bhs244

      Durrant, S. J., Taylor, C., Cairney, S., & Lewis, P. A. (2011). Sleep-dependent consolidation of statistical learning. Neuropsychologia, 49(5), 1322–1331. https://doi.org/10.1016/j.neuropsychologia.2011.02.015

      Gaskell, M. G., Cairney, S. A., & Rodd, J. M. (2019). Contextual priming of word meanings is stabilized over sleep. Cognition, 182, 109–126. https://doi.org/10.1016/j.cognition.2018.09.007

      Henke, K. (2010). A model for memory systems based on processing modes rather than consciousness. Nature Reviews Neuroscience, 11(7), 523–532. https://doi.org/10.1038/nrn2850

      Kóbor, A., Janacsek, K., Takács, Á., & Nemeth, D. (2017). Statistical learning leads to persistent memory: Evidence for one-year consolidation. Scientific Reports, 7(1), 760. https://doi.org/10.1038/s41598-017-00807-3

      Kuhl, B. A., & Chun, M. M. (2014). Successful remembering elicits event-specific activity patterns in lateral parietal cortex. The Journal of Neuroscience, 34(23), 8051–8060. https://doi.org/10.1523/JNEUROSCI.4328-13.2014

      Richter, F. R., Chanales, A. J. H., & Kuhl, B. A. (2016). Predicting the integration of overlapping memories by decoding mnemonic processing states during learning. NeuroImage, 124, Part A, 323–335. https://doi.org/10.1016/j.neuroimage.2015.08.051

      Schapiro, A. C., Gregory, E., Landau, B., McCloskey, M., & Turk-Browne, N. B. (2014). The necessity of the medial-temporal lobe for statistical learning. Journal of Cognitive Neuroscience, 1–12. https://doi.org/10.1162/jocn_a_00578

      Schlichting, M. L., & Preston, A. R. (2014). Memory reactivation during rest supports upcoming learning of related content. Proceedings of the National Academy of Sciences, 111(44), 15845–15850. https://doi.org/10.1073/pnas.1404396111

      Smith, J. F., Alexander, G. E., Chen, K., Husain, F. T., Kim, J., Pajor, N., & Horwitz, B. (2010). Imaging systems level consolidation of novel associate memories: A longitudinal neuroimaging study. NeuroImage, 50(2), 826–836. https://doi.org/10.1016/j.neuroimage.2009.11.053

      Takashima, A., Nieuwenhuis, I. L. C., Jensen, O., Talamini, L. M., Rijpkema, M., & Fernández, G. (2009). Shift from hippocampal to neocortical centered retrieval network with consolidation. The Journal of Neuroscience, 29(32), 10087–10093. https://doi.org/10.1523/JNEUROSCI.0799-09.2009

      Tamminen, J., & Gaskell, M. G. (2013). Novel word integration in the mental lexicon: Evidence from unmasked and masked semantic priming. The Quarterly Journal of Experimental Psychology, 66(5), 1001–1025. https://doi.org/10.1080/17470218.2012.724694

      van Kesteren, M. T. R. van, Fernández, G., Norris, D. G., & Hermans, E. J. (2010). Persistent schema-dependent hippocampal-neocortical connectivity during memory encoding and postencoding rest in humans. Proceedings of the National Academy of Sciences, 107(16), 7550–7555. https://doi.org/10.1073/pnas.0914892107

      Wang, H.-C., Savage, G., Gaskell, M. G., Paulin, T., Robidoux, S., & Castles, A. (2017). Bedding down new words: Sleep promotes the emergence of lexical competition in visual word recognition. Psychonomic Bulletin & Review, 24(4), 1186–1193. https://doi.org/10.3758/s13423-016-1182-7

    1. Author Response

      Reviewer #1 (Public Review):

      This study used a multi-day learning paradigm combined with fMRI to reveal neural changes reflecting the learning of new (arbitrary) shape-sound associations. In the scanner, the shapes and sounds are presented separately and together, both before and after learning. When they are presented together, they can be either consistent or inconsistent with the learned associations. The analyses focus on auditory and visual cortices, as well as the object-selective cortex (LOC) and anterior temporal lobe regions (temporal pole (TP) and perirhinal cortex (PRC)). Results revealed several learning-induced changes, particularly in the anterior temporal lobe regions. First, the LOC and PRC showed a reduced bias to shapes vs sounds (presented separately) after learning. Second, the TP responded more strongly to incongruent than congruent shape-sound pairs after learning. Third, the similarity of TP activity patterns to sounds and shapes (presented separately) was increased for non-matching shape-sound comparisons after learning. Fourth, when comparing the pattern similarity of individual features to combined shape-sound stimuli, the PRC showed a reduced bias towards visual features after learning. Finally, comparing patterns to combined shape-sound stimuli before and after learning revealed a reduced (and negative) similarity for incongruent combinations in PRC. These results are all interpreted as evidence for an explicit integrative code of newly learned multimodal objects, in which the whole is different from the sum of the parts.

      The study has many strengths. It addresses a fundamental question that is of broad interest, the learning paradigm is well-designed and controlled, and the stimuli are real 3D stimuli that participants interact with. The manuscript is well written and the figures are very informative, clearly illustrating the analyses performed.

      There are also some weaknesses. The sample size (N=17) is small for detecting the subtle effects of learning. Most of the statistical analyses are not corrected for multiple comparisons (ROIs), and the specificity of the key results to specific regions is also not tested. Furthermore, the evidence for an integrative representation is rather indirect, and alternative interpretations for these results are not considered.

      We thank the reviewer for their careful reading and the positive comments on our manuscript. As suggested, we have conducted additional analyses of theoretically-motivated ROIs and have found that temporal pole and perirhinal cortex are the only regions to show the key experience-dependent transformations. We are much more cautious with respect to multiple comparisons, and have removed a series of post hoc across-ROI comparisons that were irrelevant to the key questions of the present manuscript. The revised manuscript now includes much more discussion about alternative interpretations as suggested by the reviewer (and also by the other reviewers).

      Additionally, we looked into scanning more participants, but our scanner has since had a full upgrade and the sequence used in the current study is no longer supported by our scanner. However, we note that while most analyses contain 17 participants, we employed a within-subject learning design that is not typically used in fMRI experiments and increases our power to detect an effect. This is supported by the robust effect size of the behavioural data, whereby 17 out of 18 participants revealed a learning effect (Cohen’s D = 1.28) and which was replicated in a follow-up experiment with a larger sample size.

      We address the other reviewer comments point-by-point in the below.

      Reviewer #2 (Public Review):

      Li et al. used a four-day fMRI design to investigate how unimodal feature information is combined, integrated, or abstracted to form a multimodal object representation. The experimental question is of great interest and understanding how the human brain combines featural information to form complex representations is relevant for a wide range of researchers in neuroscience, cognitive science, and AI. While most fMRI research on object representations is limited to visual information, the authors examined how visual and auditory information is integrated to form a multimodal object representation. The experimental design is elegant and clever. Three visual shapes and three auditory sounds were used as the unimodal features; the visual shapes were used to create 3D-printed objects. On Day 1, the participants interacted with the 3D objects to learn the visual features, but the objects were not paired with the auditory features, which were played separately. On Day 2, participants were scanned with fMRI while they were exposed to the unimodal visual and auditory features as well as pairs of visual-auditory cues. On Day 3, participants again interacted with the 3D objects but now each was paired with one of the three sounds that played from an internal speaker. On Day 4, participants completed the same fMRI scanning runs they completed on Day 2, except now some visual-auditory feature pairs corresponded with Congruent (learned) objects, and some with Incongruent (unlearned) objects. Using the same fMRI design on Days 2 and 4 enables a well-controlled comparison between feature- and object-evoked neural representations before and after learning. The notable results corresponded to findings in the perirhinal cortex and temporal pole. The authors report (1) that a visual bias on Day 2 for unimodal features in the perirhinal cortex was attenuated after learning on Day 4, (2) a decreased univariate response to congruent vs. incongruent visual-auditory objects in the temporal pole on Day 4, (3) decreased pattern similarity between congruent vs. incongruent pairs of visual and auditory unimodal features in the temporal pole on Day 4, (4) in the perirhinal cortex, visual unimodal features on Day 2 do not correlate with their respective visual-auditory objects on Day 4, and (5) in the perirhinal cortex, multimodal object representations across Days 2 and 4 are uncorrelated for congruent objects and anticorrelated for incongruent. The authors claim that each of these results supports the theory that multimodal objects are represented in an "explicit integrative" code separate from feature representations. While these data are valuable and the results are interesting, the authors' claims are not well supported by their findings.

      We thank the reviewer for the careful reading of our manuscript and positive comments. Overall, we now stay closer to the data when describing the results and provide our interpretation of these results in the discussion section while remaining open to alternative interpretations (as also suggested by Reviewer 1).

      (1) In the introduction, the authors contrast two theories: (a) multimodal objects are represented in the co-activation of unimodal features, and (b) multimodal objects are represented in an explicit integrative code such that the whole is different than the sum of its parts. However, the distinction between these two theories is not straightforward. An explanation of what is precisely meant by "explicit" and "integrative" would clarify the authors' theoretical stance. Perhaps we can assume that an "explicit" representation is a new representation that is created to represent a multimodal object. What is meant by "integrative" is more ambiguous-unimodal features could be integrated within a representation in a manner that preserves the decodability of the unimodal features, or alternatively the multimodal representation could be completely abstracted away from the constituent features such that the features are no longer decodable. Even if the object representation is "explicit" and distinct from the unimodal feature representations, it can in theory still contain featural information, though perhaps warped or transformed. The authors do not clearly commit to a degree of featural abstraction in their theory of "explicit integrative" multimodal object representations which makes it difficult to assess the validity of their claims.

      Due to its ambiguity, we removed the term “explicit” and now make it clear that our central question was whether crossmodal object representations require only unimodal feature-level representations (e.g., frogs are created from only the combination of shape and sound) or whether crossmodal object representations also rely on an integrative code distinct from the unimodal features (e.g., there is something more to “frog” than its original shape and sound). We now clarify this in the revised manuscript.

      “One theoretical view from the cognitive sciences suggests that crossmodal objects are built from component unimodal features represented across distributed sensory regions.8 Under this view, when a child thinks about “frog”, the visual cortex represents the appearance of the shape of the frog whereas the auditory cortex represents the croaking sound. Alternatively, other theoretical views predict that multisensory objects are not only built from their component unimodal sensory features, but that there is also a crossmodal integrative code that is different from the sum of these parts.9,10,11,12,13 These latter views propose that anterior temporal lobe structures can act as a polymodal “hub” that combines separate features into integrated wholes.9,11,14,15” – pg. 4

      For this reason, we designed our paradigm to equate the unimodal representations, such that neural differences between the congruent and incongruent conditions provide evidence for a crossmodal integrative code different from the unimodal features (because the unimodal features are equated by default in the design).

      “Critically, our four-day learning task allowed us to isolate any neural activity associated with integrative coding in anterior temporal lobe structures that emerges with experience and differs from the neural patterns recorded at baseline. The learned and non-learned crossmodal objects were constructed from the same set of three validated shape and sound features, ensuring that factors such as familiarity with the unimodal features, subjective similarity, and feature identity were tightly controlled (Figure 2). If the mind represented crossmodal objects entirely as the reactivation of unimodal shapes and sounds (i.e., objects are constructed from their parts), then there should be no difference between the learned and non-learned objects (because they were created from the same three shapes and sounds). By contrast, if the mind represented crossmodal objects as something over and above their component features (i.e., representations for crossmodal objects rely on integrative coding that is different from the sum of their parts), then there should be behavioral and neural differences between learned and non-learned crossmodal objects (because the only difference across the objects is the learned relationship between the parts). Furthermore, this design allowed us to determine the relationship between the object representation acquired after crossmodal learning and the unimodal feature representations acquired before crossmodal learning. That is, we could examine whether learning led to abstraction of the object representations such that it no longer resembled the unimodal feature representations.” – pg. 5

      Furthermore, we agree with the reviewer that our definition and methodological design does not directly capture the structure of the integrative code. With experience, the unimodal feature representations may be completely abstracted away, warped, or changed in a nonlinear transformation. We suggest that crossmodal learning forms an integrative code that is different from the original unimodal representations in the anterior temporal lobes, however, we agree that future work is needed to more directly capture the structure of the integrative code that emerges with experience.

      “In our task, participants had to differentiate congruent and incongruent objects constructed from the same three shape and sound features (Figure 2). An efficient way to solve this task would be to form distinct object-level outputs from the overlapping unimodal feature-level inputs such that congruent objects are made to be orthogonal from the representations before learning (i.e., measured as pattern similarity equal to 0 in the perirhinal cortex; Figure 5b, 6, Supplemental Figure S5), whereas non-learned incongruent objects could be made to be dissimilar from the representations before learning (i.e., anticorrelation, measured as patten similarity less than 0 in the perirhinal cortex; Figure 6). Because our paradigm could decouple neural responses to the learned object representations (on Day 4) from the original component unimodal features at baseline (on Day 2), these results could be taken as evidence of pattern separation in the human perirhinal cortex.11,12 However, our pattern of results could also be explained by other types of crossmodal integrative coding. For example, incongruent object representations may be less stable than congruent object representations, such that incongruent objects representation are warped to a greater extent than congruent objects (Figure 6).” – pg. 18

      “As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation.” – pg. 18

      (2) After participants learned the multimodal objects, the authors report a decreased univariate response to congruent visual-auditory objects relative to incongruent objects in the temporal pole. This is claimed to support the existence of an explicit, integrative code for multimodal objects. Given the number of alternative explanations for this finding, this claim seems unwarranted. A simpler interpretation of these results is that the temporal pole is responding to the novelty of the incongruent visual-auditory objects. If there is in fact an explicit, integrative multimodal object representation in the temporal pole, it is unclear why this would manifest in a decreased univariate response.

      We thank the reviewer for identifying this issue. Our behavioural design controls unimodal feature-level novelty but allows object-level novelty to differ. Thus, neural differences between the congruent and incongruent conditions reflects sensitivity to the object-level differences between the combination of shape and sound. However, we agree that there are multiple interpretations regarding the nature of how the integrative code is structured in the temporal pole and perirhinal cortex. We have removed the interpretation highlighted by the reviewer from the results. Instead, we now provide our preferred interpretation in the discussion, while acknowledging the other possibilities that the reviewer mentions.

      As one possibility, these results in temporal pole may reflect “conceptual combination”. “hummingbird” – a congruent pairing – may require less neural resources than an incongruent pairing such as “bark-frog”.

      “Furthermore, these distinct anterior temporal lobe structures may be involved with integrative coding in different ways. For example, the crossmodal object representations measured after learning were found to be related to the component unimodal feature representations measured before learning in the temporal pole but not the perirhinal cortex (Figure 5, 6, Supplemental Figure S5). Moreover, pattern similarity for congruent shape-sound pairs were lower than the pattern similarity for incongruent shape-sound pairs after crossmodal learning in the temporal pole but not the perirhinal cortex (Figure 4b, Supplemental Figure S3a). As one interpretation of this pattern of results, the temporal pole may represent new crossmodal objects by combining previously learned knowledge. 8,9,10,11,13,14,15,33 Specifically, research into conceptual combination has linked the anterior temporal lobes to compound object concepts such as “hummingbird”.34,35,36 For example, participants during our task may have represented the sound-based “humming” concept and visually-based “bird” concept on Day 1, forming the crossmodal “hummingbird” concept on Day 3; Figure 1, 2, which may recruit less activity in temporal pole than an incongruent pairing such as “barking-frog”. For these reasons, the temporal pole may form a crossmodal object code based on pre-existing knowledge, resulting in reduced neural activity (Figure 3d) and pattern similarity towards features associated with learned objects (Figure 4b).”– pg. 18

      (3) The authors ran a neural pattern similarity analysis on the unimodal features before and after multimodal object learning. They found that the similarity between visual and auditory features that composed congruent objects decreased in the temporal pole after multimodal object learning. This was interpreted to reflect an explicit integrative code for multimodal objects, though it is not clear why. First, behavioral data show that participants reported increased similarity between the visual and auditory unimodal features within congruent objects after learning, the opposite of what was found in the temporal pole. Second, it is unclear why an analysis of the unimodal features would be interpreted to reflect the nature of the multimodal object representations. Since the same features corresponded with both congruent and incongruent objects, the nature of the feature representations cannot be interpreted to reflect the nature of the object representations per se. Third, using unimodal feature representations to make claims about object representations seems to contradict the theoretical claim that explicit, integrative object representations are distinct from unimodal features. If the learned multimodal object representation exists separately from the unimodal feature representations, there is no reason why the unimodal features themselves would be influenced by the formation of the object representation. Instead, these results seem to more strongly support the theory that multimodal object learning results in a transformation or warping of feature space.

      We apologize for the lack of clarity. We have now overhauled this aspect of our manuscript in an attempt to better highlight key aspects of our experimental design. In particular, because the unimodal features composing the congruent and incongruent objects were equated, neural differences between these conditions would provide evidence for an experience-dependent crossmodal integrative code that is different from its component unimodal features.

      Related to the second and third points, we were looking at the extent to which the original unimodal representations change with crossmodal learning. Before crossmodal learning, we found that the perirhinal cortex tracked the similarity between the individual visual shape features and the crossmodal objects that were composed of those visual shapes – however, there was no evidence that perirhinal cortex was tracking the unimodal sound features on those crossmodal objects. After crossmodal learning, we see that this visual shape bias in perirhinal cortex was no longer present – that is, the representation in perirhinal cortex started to look less like the visual features that comprise the objects. Thus, crossmodal learning transformed the perirhinal representations so that they were no longer predominantly grounded in a single visual modality, which may be a mechanism by which object concepts gain their abstraction. We have now tried to be clearer about this interpretation throughout the paper.

      Notably, we suggest that experience may change both the crossmodal object representations, as well as the unimodal feature representations. For example, we have previously shown that unimodal visual features are influenced by experience in parallel with the representation of the conjunction (e.g., Liang et al., 2020; Cerebral Cortex). Nevertheless, we remain open to the myriad possible structures of the integrative code that might emerge with experience.

      We now clarify these points throughout the manuscript. For example:

      “We then examined whether the original representations would change after participants learned how the features were paired together to make specific crossmodal objects, conducting the same analysis described above after crossmodal learning had taken place (Figure 5b). With this analysis, we sought to measure the relationship between the representation for the learned crossmodal object and the original baseline representation for the unimodal features. More specifically, the voxel-wise activity for unimodal feature runs before crossmodal learning was correlated to the voxel-wise activity for crossmodal object runs after crossmodal learning (Figure 5b). Another linear mixed model which included modality as a fixed factor within each ROI revealed that the perirhinal cortex was no longer biased towards visual shape after crossmodal learning (F1,32 = 0.12, p = 0.73), whereas the temporal pole, LOC, V1, and A1 remained biased towards either visual shape or sound (F1,30-32 between 16.20 and 73.42, all p < 0.001, η2 between 0.35 and 0.70).” – pg. 14

      “To investigate this effect in perirhinal cortex more specifically, we conducted a linear mixed model to directly compare the change in the visual bias of perirhinal representations from before crossmodal learning to after crossmodal learning (green regions in Figure 5a vs. 5b). Specifically, the linear mixed model included learning day (before vs. after crossmodal learning) and modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object). Results revealed a significant interaction between learning day and modality in the perirhinal cortex (F1,775 = 5.56, p = 0.019, η2 = 0.071), meaning that the baseline visual shape bias observed in perirhinal cortex (green region of Figure 5a) was significantly attenuated with experience (green region of Figure 5b). After crossmodal learning, a given shape no longer invoked significant pattern similarity between objects that had the same shape but differed in terms of what they sounded like. Taken together, these results suggest that prior to learning the crossmodal objects, the perirhinal cortex had a default bias toward representing the visual shape information and was not representing sound information of the crossmodal objects. After crossmodal learning, however, the visual shape bias in perirhinal cortex was no longer present. That is, with crossmodal learning, the representations within perirhinal cortex started to look less like the visual features that comprised the crossmodal objects, providing evidence that the perirhinal representations were no longer predominantly grounded in the visual modality.” – pg. 13

      “Importantly, the initial visual shape bias observed in the perirhinal cortex was attenuated by experience (Figure 5, Supplemental Figure S5), suggesting that the perirhinal representations had become abstracted and were no longer predominantly grounded in a single modality after crossmodal learning. One possibility may be that the perirhinal cortex is by default visually driven as an extension to the ventral visual stream,10,11,12 but can act as a polymodal “hub” region for additional crossmodal input following learning.” – pg. 19

      (4) The most compelling evidence the authors provide for their theoretical claims is the finding that, in the perirhinal cortex, the unimodal feature representations on Day 2 do not correlate with the multimodal objects they comprise on Day 4. This suggests that the learned multimodal object representations are not combinations of their unimodal features. If unimodal features are not decodable within the congruent object representations, this would support the authors' explicit integrative hypothesis. However, the analyses provided do not go all the way in convincing the reader of this claim. First, the analyses reported do not differentiate between congruent and incongruent objects. If this result in the perirhinal cortex reflects the formation of new multimodal object representations, it should only be true for congruent objects but not incongruent objects. Since the analyses combine congruent and incongruent objects it is not possible to know whether this was the case. Second, just because feature representations on Day 2 do not correlate with multimodal object patterns on Day 4 does not mean that the object representations on Day 4 do not contain featural information. This could be directly tested by correlating feature representations on Day 4 with congruent vs. incongruent object representations on Day 4. It could be that representations in the perirhinal cortex are not stable over time and all representations-including unimodal feature representations-shift between sessions, which could explain these results yet not entail the existence of abstracted object representations.

      We thank the reviewer for this suggestion and have conducted the two additional analyses. Specifically, we split the congruent and incongruent conditions and also investigated correlations between unimodal representations on Day 4 with crossmodal object representations on Day 4. There was no significant interaction between modality and congruency in any ROI across or within learning days. One possible explanation for these findings is that both congruent and incongruent crossmodal objects are represented differently from their underlying unimodal features, and all of these representations can transform with experience.

      However, the new analyses also revealed that perirhinal cortex was the only region without a modality-specific bias after crossmodal learning (e.g., Day 4 Unimodal Feature runs x Day 4 Crossmodal Object runs; now shown in Supplemental Figure S5). Overall, these results are consistent with the notion of a crossmodal integrative code in perirhinal cortex that has changed with experience and is different from the component unimodal features. Nevertheless, we explore alternative interpretations for how the crossmodal code emerges with experience in the discussion.

      “To examine whether these results differed by congruency (i.e., whether any modality-specific biases differed as a function of whether the object was congruent or incongruent), we conducted exploratory linear mixed models for each of the five a priori ROIs across learning days. More specifically, we correlated: 1) the voxel-wise activity for Unimodal Feature Runs before crossmodal learning to the voxel-wise activity for Crossmodal Object Runs before crossmodal learning (Day 2 vs. Day 2), 2) the voxel-wise activity for Unimodal Feature Runs before crossmodal learning to the voxel-wise activity for Crossmodal Object Runs after crossmodal learning (Day 2 vs Day 4), and 3) the voxel-wise activity for Unimodal Feature Runs after crossmodal learning to the voxel-wise activity for Crossmodal Object Runs after crossmodal learning (Day 4 vs Day 4). For each of the three analyses described, we then conducted separate linear mixed models which included modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object) and congruency (congruent vs. incongruent)….There was no significant relationship between modality and congruency in any ROI between Day 2 and Day 2 (F1,346-368 between 0.00 and 1.06, p between 0.30 and 0.99), between Day 2 and Day 4 (F1,346-368 between 0.021 and 0.91, p between 0.34 and 0.89), or between Day 4 and Day 4 (F1,346-368 between 0.01 and 3.05, p between 0.082 and 0.93). However, exploratory analyses revealed that perirhinal cortex was the only region without a modality-specific bias and where the unimodal feature runs were not significantly correlated to the crossmodal object runs after crossmodal learning (Supplemental Figure S5).” – pg. 14

      “Taken together, the overall pattern of results suggests that representations of the crossmodal objects in perirhinal cortex were heavily influenced by their consistent visual features before crossmodal learning. However, the crossmodal object representations were no longer influenced by the component visual features after crossmodal learning (Figure 5, Supplemental Figure S5). Additional exploratory analyses did not find evidence of experience-dependent changes in the hippocampus or inferior parietal lobes (Supplemental Figure S4c-e).” – pg. 14

      “The voxel-wise matrix for Unimodal Feature runs on Day 4 were correlated to the voxel-wise matrix for Crossmodal Object runs on Day 4 (see Figure 5 in the main text for an example). We compared the average pattern similarity (z-transformed Pearson correlation) between shape (blue) and sound (orange) features specifically after crossmodal learning. Consistent with Figure 5b, perirhinal cortex was the only region without a modality-specific bias. Furthermore, perirhinal cortex was the only region where the representations of both the visual and sound features were not significantly correlated to the crossmodal objects. By contrast, every other region maintained a modality-specific bias for either the visual or sound features. These results suggest that perirhinal cortex representations were transformed with experience, such that the initial visual shape representations (Figure 5a) were no longer grounded in a single modality after crossmodal learning. Furthermore, these results suggest that crossmodal learning formed an integrative code different from the unimodal features in perirhinal cortex, as the visual and sound features were not significantly correlated with the crossmodal objects. * p < 0.05, ** p < 0.01, *** p < 0.001. Horizontal lines within brain regions indicate a significant main effect of modality. Vertical asterisks denote pattern similarity comparisons relative to 0.” – Supplemental Figure S5

      “We found that the temporal pole and perirhinal cortex – two anterior temporal lobe structures – came to represent new crossmodal object concepts with learning, such that the acquired crossmodal object representations were different from the representation of the constituent unimodal features (Figure 5, 6). Intriguingly, the perirhinal cortex was by default biased towards visual shape, but that this initial visual bias was attenuated with experience (Figure 3c, 5, Supplemental Figure S5). Within the perirhinal cortex, the acquired crossmodal object concepts (measured after crossmodal learning) became less similar to their original component unimodal features (measured at baseline before crossmodal learning); Figure 5, 6, Supplemental Figure S5. This is consistent with the idea that object representations in perirhinal cortex integrate the component sensory features into a whole that is different from the sum of the component parts, which might be a mechanism by which object concepts obtain their abstraction…. As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation.” – pg. 18

      In sum, the authors have collected a fantastic dataset that has the potential to answer questions about the formation of multimodal object representations in the brain. A more precise delineation of different theoretical accounts and additional analyses are needed to provide convincing support for the theory that “explicit integrative” multimodal object representations are formed during learning.

      We thank the reviewer for the positive comments and helpful feedback. We hope that our changes to our wording and clarifications to our methodology now more clearly supports the central goal of our study: to find evidence of crossmodal integrative coding different from the original unimodal feature parts in anterior temporal lobe structures. We furthermore agree that future research is needed to delineate the structure of the integrative code that emerges with experience in the anterior temporal lobes.

      Reviewer #3 (Public Review):

      This paper uses behavior and functional brain imaging to understand how neural and cognitive representations of visual and auditory stimuli change as participants learn associations among them. Prior work suggests that areas in the anterior temporal (ATL) and perirhinal cortex play an important role in learning/representing cross-modal associations, but the hypothesis has not been directly tested by evaluating behavior and functional imaging before and after learning cross- modal associations. The results show that such learning changes both the perceived similarities amongst stimuli and the neural responses generated within ATL and perirhinal regions, providing novel support for the view that cross-modal learning leads to a representational change in these regions.

      This work has several strengths. It tackles an important question for current theories of object representation in the mind and brain in a novel and quite direct fashion, by studying how these representations change with cross-modal learning. As the authors note, little work has directly assessed representational change in ATL following such learning, despite the widespread view that ATL is critical for such representation. Indeed, such direct assessment poses several methodological challenges, which the authors have met with an ingenious experimental design. The experiment allows the authors to maintain tight control over both the familiarity and the perceived similarities amongst the shapes and sounds that comprise their stimuli so that the observed changes across sessions must reflect learned cross-modal associations among these. I especially appreciated the creation of physical objects that participants can explore and the approach to learning in which shapes and sounds are initially experienced independently and later in an associated fashion. In using multi-echo MRI to resolve signals in ventral ATL, the authors have minimized a key challenge facing much work in this area (namely the poor SNR yielded by standard acquisition sequences in ventral ATL). The use of both univariate and multivariate techniques was well-motivated and helpful in testing the central questions. The manuscript is, for the most part, clearly written, and nicely connects the current work to important questions in two literatures, specifically (1) the hypothesized role of the perirhinal cortex in representing/learning complex conjunctions of features and (2) the tension between purely embodied approaches to semantic representation vs the view that ATL regions encode important amodal/crossmodal structure.

      There are some places in the manuscript that would benefit from further explanation and methodological detail. I also had some questions about the results themselves and what they signify about the roles of ATL and the perirhinal cortex in object representation.

      We thank the reviewer for their positive feedback and address the comments in the below point-by-point responses.

      (A) I found the terms "features" and "objects" to be confusing as used throughout the manuscript, and sometimes inconsistent. I think by "features" the authors mean the shape and sound stimuli in their experiment. I think by "object" the authors usually mean the conjunction of a shape with a sound---for instance, when a shape and sound are simultaneously experienced in the scanner, or when the participant presses a button on the shape and hears the sound. The confusion comes partly because shapes are often described as being composed of features, not features in and of themselves. (The same is sometimes true of sounds). So when reading "features" I kept thinking the paper referred to the elements that went together to comprise a shape. It also comes from ambiguous use of the word object, which might refer to (a) the 3D- printed item that people play with, which is an object, or (b) a visually-presented shape (for instance, the localizer involved comparing an "object" to a "phase-scrambled" stimulus---here I assume "object" refers to an intact visual stimulus and not the joint presentation of visual and auditory items). I think the design, stimuli, and results would be easier for a naive reader to follow if the authors used the terms "unimodal representation" to refer to cases where only visual or auditory input is presented, and "cross-modal" or "conjoint" representation when both are present.

      We thank the reviewer for this suggestion and agree. We have replaced the terms “features” and “objects” with “unimodal” and “crossmodal” in the title, text, and figures throughout the manuscript for consistency (i.e., “crossmodal binding problem”). To simplify the terminology, we have also removed the localizer results.

      (B) There are a few places where I wasn't sure what exactly was done, and where the methods lacked sufficient detail for another scientist to replicate what was done. Specifically:

      (1) The behavioral study assessing perceptual similarity between visual and auditory stimuli was unclear. The procedure, stimuli, number of trials, etc, should be explained in sufficient detail in methods to allow replication. The results of the study should also minimally be reported in the supplementary information. Without an understanding of how these studies were carried out, it was very difficult to understand the observed pattern of behavioral change. For instance, I initially thought separate behavioral blocks were carried out for visual versus auditory stimuli, each presented in isolation; however, the effects contrast congruent and incongruent stimuli, which suggests these decisions must have been made for the conjoint presentation of both modalities. I'm still not sure how this worked. Additionally, the manuscript makes a brief mention that similarity judgments were made in the context of "all stimuli," but I didn't understand what that meant. Similarity ratings are hugely sensitive to the contrast set with which items appear, so clarity on these points is pretty important. A strength of the design is the contention that shape and sound stimuli were psychophysically matched, so it is important to show the reader how this was done and what the results were.

      We agree and apologize for the lack of sufficient detail in the original manuscript. We now include much more detail about the similarity rating task. The methodology and results of the behavioral rating experiments are now shown in Supplemental Figure S1. In Figure S1a, the similarity ratings are visualized on a multidimensional scaling plot. The triangular geometry for shape (blue) and sound (red) indicate that the subjective similarity was equated within each unimodal feature across individual participants. Quantitatively, there was no difference in similarity between the congruent and incongruent pairings in Figure S1b and Figure S1c prior to crossmodal learning. In addition to providing more information on these methods in the Supplemental Information, we also now provide a more detailed description of the task in the manuscript itself. For convenience, we reproduce these sections below.

      “Pairwise Similarity Task. Using the same task as the stimulus validation procedure (Supplemental Figure S1a), participants provided similarity ratings for all combinations of the 3 validated shapes and 3 validated sounds (each of the six features were rated in the context of every other feature in the set, with 4 repeats of the same feature, for a total of 72 trials). More specifically, three stimuli were displayed on each trial, with one at the top and two at the bottom of the screen in the same procedure as we have used previously27. The 3D shapes were visually displayed as a photo, whereas sounds were displayed on screen in a box that could be played over headphones when clicked with the mouse. The participant made an initial judgment by selecting the more similar stimulus on the bottom relative to the stimulus on the top. Afterwards, the participant made a similarity rating between each bottom stimulus with the top stimulus from 0 being no similarity to 5 being identical. This procedure ensured that ratings were made relative to all other stimuli in the set.”– pg. 28

      “Pairwise similarity task and results. In the initial stimulus validation experiment, participants provided pairwise ratings for 5 sounds and 3 shapes. The shapes were equated in their subjective similarity that had been selected from a well-characterized perceptually uniform stimulus space27 and the pairwise ratings followed the same procedure as described in ref 27. Based on this initial experiment, we then selected the 3 sounds from the that were most closely equated in their subjective similarity. (a) 3D-printed shapes were displayed as images, whereas sounds were displayed in a box that could be played when clicked by the participant. Ratings were averaged to produce a similarity matrix for each participant, and then averaged to produce a group-level similarity matrix. Shown as triangular representational geometries recovered from multidimensional scaling in the above, shapes (blue) and sounds (orange) were approximately equated in their subjective similarity. These features were then used in the four-day crossmodal learning task. (b) Behavioral results from the four-day crossmodal learning task paired with multi-echo fMRI described in the main text. Before crossmodal learning, there was no difference in similarity between shape and sound features associated with congruent objects compared to incongruent objects – indicating that similarity was controlled at the unimodal feature-level. After crossmodal learning, we observed a robust shift in the magnitude of similarity. The shape and sound features associated with congruent objects were now significantly more similar than the same shape and sound features associated with incongruent objects (p < 0.001), evidence that crossmodal learning changed how participants experienced the unimodal features (observed in 17/18 participants). (c) We replicated this learning-related shift in pattern similarity with a larger sample size (n = 44; observed in 38/44 participants). *** denotes p < 0.001. Horizontal lines denote the comparison of congruent vs. incongruent conditions. – Supplemental Figure S1

      (2) The experiences through which participants learned/experienced the shapes and sounds were unclear. The methods mention that they had one minute to explore/palpate each shape and that these experiences were interleaved with other tasks, but it is not clear what the other tasks were, how many such exploration experiences occurred, or how long the total learning time was. The manuscript also mentions that participants learn the shape-sound associations with 100% accuracy but it isn't clear how that was assessed. These details are important partly b/c it seems like very minimal experience to change neural representations in the cortex.

      We apologize for the lack of detail and agree with the reviewer’s suggestions – we now include much more information in the methods section. Each behavioral day required about 1 hour of total time to complete, and indeed, participants rapidly learned their associations with minimal experience. For example:

      “Behavioral Tasks. On each behavioral day (Day 1 and Day 3; Figure 2), participants completed the following tasks, in this order: Exploration Phase, one Unimodal Feature 1-back run (26 trials), Exploration Phase, one Crossmodal 1-back run (26 trials), Exploration Phase, Pairwise Similarity Task (24 trials), Exploration Phase, Pairwise Similarity Task (24 trials), Exploration Phase, Pairwise Similarity Task (24 trials), and finally, Exploration Phase. To verify learning on Day 3, participants also additionally completed a Learning Verification Task at the end of the session. – pg. 27

      “The overall procedure ensured that participants extensively explored the unimodal features on Day 1 and the crossmodal objects on Day 3. The Unimodal Feature and the Crossmodal Object 1-back runs administered on Day 1 and Day 3 served as practice for the neuroimaging sessions on Day 2 and Day 4, during which these 1-back tasks were completed. Each behavioral session required less than 1 hour of total time to complete.” – pg. 27

      “Learning Verification Task (Day 3 only). As the final task on Day 3, participants completed a task to ensure that participants successfully formed their crossmodal pairing. All three shapes and sounds were randomly displayed in 6 boxes on a display. Photos of the 3D shapes were shown, and sounds were played by clicking the box with the mouse cursor. The participant was cued with either a shape or sound, and then selected the corresponding paired feature. At the end of Day 3, we found that all participants reached 100% accuracy on this task (10 trials).” – pg. 29

      (3) I didn't understand the similarity metric used in the multivariate imaging analyses. The manuscript mentions Z-scored Pearson's r, but I didn't know if this meant (a) many Pearson coefficients were computed and these were then Z-scored, so that 0 indicates a value equal to the mean Pearson correlation and 1 is equal to the standard deviation of the correlations, or (b) whether a Fisher Z transform was applied to each r (so that 0 means r was also around 0). From the interpretation of some results, I think the latter is the approach taken, but in general, it would be helpful to see, in Methods or Supplementary information, exactly how similarity scores were computed, and why that approach was adopted. This is particularly important since it is hard to understand the direction of some key effects.

      The reviewer is correct that the Fisher Z transform was applied to each individual r before averaging the correlations. This approach is generally recommended when averaging correlations (see Corey, Dunlap, & Burke, 1998). We are now clearer on this point in the manuscript:

      “The z-transformed Pearson’s correlation coefficient was used as the distance metric for all pattern similarity analyses. More specifically, each individual Pearson correlation was Fisher z-transformed and then averaged (see 61).” – pg. 32

      (C) From Figure 3D, the temporal pole mask appears to exclude the anterior fusiform cortex (or the ventral surface of the ATL generally). If so, this is a shame, since that appears to be the locus most important to cross-modal integration in the "hub and spokes" model of semantic representation in the brain. The observation in the paper that the perirhinal cortex seems initially biased toward visual structure while more superior ATL is biased toward auditory structure appears generally consistent with the "graded hub" view expressed, for instance, in our group's 2017 review paper (Lambon Ralph et al., Nature Reviews Neuroscience). The balance of visual- versus auditory-sensitivity in that work appears balanced in the anterior fusiform, just a little lateral to the anterior perirhinal cortex. It would be helpful to know if the same pattern is observed for this area specifically in the current dataset.

      We thank the reviewer for this suggestion. After close inspection of Lambon Ralph et al. (2017), we believe that our perirhinal cortex mask appears to be overlapping with the ventral ATL/anterior fusiform region that the reviewer mentions. See Author response image 1 for a visual comparison:

      Author response image 1.

      The top four figures are sampled from Lambon Ralph et al (2017), whereas the bottom two figures visualize our perirhinal cortex mask (white) and temporal pole mask (dark green) relative to the fusiform cortex. The ROIs visualized were defined from the Harvard-Oxford atlas.

      We now mention this area of overlap in our manuscript and link it to the hub and spokes model:

      “Notably, our perirhinal cortex mask overlaps with a key region of the ventral anterior temporal lobe thought to be the central locus of crossmodal integration in the “hub and spokes” model of semantic representations.9,50 – pg. 20

      (D) While most effects seem robust from the information presented, I'm not so sure about the analysis of the perirhinal cortex shown in Figure 5. This compares (I think) the neural similarity evoked by a unimodal stimulus ("feature") to that evoked by the same stimulus when paired with its congruent stimulus in the other modality ("object"). These similarities show an interaction with modality prior to cross-modal association, but no interaction afterward, leading the authors to suggest that the perirhinal cortex has become less biased toward visual structure following learning. But the plots in Figures 4a and b are shown against different scales on the y-axes, obscuring the fact that all of the similarities are smaller in the after-learning comparison. Since the perirhinal interaction was already the smallest effect in the pre-learning analysis, it isn't really surprising that it drops below significance when all the effects diminish in the second comparison. A more rigorous test would assess the reliability of the interaction of comparison (pre- or post-learning) with modality. The possibility that perirhinal representations become less "visual" following cross-modal learning is potentially important so a post hoc contrast of that kind would be helpful.

      We apologize for the lack of clarity. We conducted a linear mixed model to assess the interaction between modality and crossmodal learning day (before and after crossmodal learning) in the perirhinal cortex as described by the reviewer. The critical interaction was significant, which is now clarified in the text as well as in the rescaled figure plots.

      “To investigate this effect in perirhinal cortex more specifically, we conducted a linear mixed model to directly compare the change in the visual bias of perirhinal representations from before crossmodal learning to after crossmodal learning (green regions in Figure 5a vs. 5b). Specifically, the linear mixed model included learning day (before vs. after crossmodal learning) and modality (visual feature match to crossmodal object vs. sound feature match to crossmodal object). Results revealed a significant interaction between learning day and modality in the perirhinal cortex (F1,775 = 5.56, p = 0.019, η2 = 0.071), meaning that the baseline visual shape bias observed in perirhinal cortex (green region of Figure 5a) was significantly attenuated with experience (green region of Figure 5b). After crossmodal learning, a given shape no longer invoked significant pattern similarity between objects that had the same shape but differed in terms of what they sounded like. Taken together, these results suggest that prior to learning the crossmodal objects, the perirhinal cortex had a default bias toward representing the visual shape information and was not representing sound information of the crossmodal objects. After crossmodal learning, however, the visual shape bias in perirhinal cortex was no longer present. That is, with crossmodal learning, the representations within perirhinal cortex started to look less like the visual features that comprised the crossmodal objects, providing evidence that the perirhinal representations were no longer predominantly grounded in the visual modality.” – pg. 13

      We note that not all effects drop in Figure 5b (even in regions with a similar numerical pattern similarity to PRC, like the hippocampus – also see Supplemental Figure S5 for a comparison for patterns only on Day 4), suggesting that the change in visual bias in PRC is not simply due to noise.

      “Importantly, the change in pattern similarity in the perirhinal cortex across learning days (Figure 5) is unlikely to be driven by noise, poor alignment of patterns across sessions, or generally reduced responses. Other regions with numerically similar pattern similarity to perirhinal cortex did not change across learning days (e.g., visual features x crossmodal objects in A1 in Figure 5; the exploratory ROI hippocampus with numerically similar pattern similarity to perirhinal cortex also did not change in Supplemental Figure S4c-d).” – pg. 14

      (E) Is there a reason the authors did not look at representation and change in the hippocampus? As a rapid-learning, widely-connected feature-binding mechanism, and given the fairly minimal amount of learning experience, it seems like the hippocampus would be a key area of potential import for the cross-modal association. It also looks as though the hippocampus is implicated in the localizer scan (Figure 3c).

      We thank the reviewer for this suggestion and now include additional analyses for the hippocampus. We found no evidence of crossmodal integrative coding different from the unimodal features. Rather, the hippocampus seems to represent the convergence of unimodal features, as evidenced by …[can you give some pithy description for what is meant by “convergence” vs “integration”?]. We provide these results in the Supplemental Information and describe them in the main text:

      “Analyses for the hippocampus (HPC) and inferior parietal lobe (IPL). (a) In the visual vs. auditory univariate analysis, there was no visual or sound bias in HPC, but there was a bias towards sounds that increased numerically after crossmodal learning in the IPL. (b) Pattern similarity analyses between unimodal features associated with congruent objects and incongruent objects. Similar to Supplemental Figure S3, there was no main effect of congruency in either region. (c) When we looked at the pattern similarity between Unimodal Feature runs on Day 2 to Crossmodal Object runs on Day 2, we found that there was significant pattern similarity when there was a match between the unimodal feature and the crossmodal object (e.g., pattern similarity > 0). This pattern of results held when (d) correlating the Unimodal Feature runs on Day 2 to Crossmodal Object runs on Day 4, and (e) correlating the Unimodal Feature runs on Day 4 to Crossmodal Object runs on Day 4. Finally, (f) there was no significant pattern similarity between Crossmodal Object runs before learning correlated to Crossmodal Object after learning in HPC, but there was significant pattern similarity in IPL (p < 0.001). Taken together, these results suggest that both HPC and IPL are sensitive to visual and sound content, as the (c, d, e) unimodal feature-level representations were correlated to the crossmodal object representations irrespective of learning day. However, there was no difference between congruent and incongruent pairings in any analysis, suggesting that HPC and IPL did not represent crossmodal objects differently from the component unimodal features. For these reasons, HPC and IPL may represent the convergence of unimodal feature representations (i.e., because HPC and IPL were sensitive to both visual and sound features), but our results do not seem to support these regions in forming crossmodal integrative coding distinct from the unimodal features (i.e., because representations in HPC and IPL did not differentiate the congruent and incongruent conditions and did not change with experience). * p < 0.05, ** p < 0.01, *** p < 0.001. Asterisks above or below bars indicate a significant difference from zero. Horizontal lines within brain regions in (a) reflect an interaction between modality and learning day, whereas horizontal lines within brain regions in reflect main effects of (b) learning day, (c-e) modality, or (f) congruency.” – Supplemental Figure S4.

      “Notably, our perirhinal cortex mask overlaps with a key region of the ventral anterior temporal lobe thought to be the central locus of crossmodal integration in the “hub and spokes” model of semantic representations.9,50 However, additional work has also linked other brain regions to the convergence of unimodal representations, such as the hippocampus51,52,53 and inferior parietal lobes.54,55 This past work on the hippocampus and inferior parietal lobe does not necessarily address the crossmodal binding problem that was the main focus of our present study, as previous findings often do not differentiate between crossmodal integrative coding and the convergence of unimodal feature representations per se. Furthermore, previous studies in the literature typically do not control for stimulus-based factors such as experience with unimodal features, subjective similarity, or feature identity that may complicate the interpretation of results when determining regions important for crossmodal integration. Indeed, we found evidence consistent with the convergence of unimodal feature-based representations in both the hippocampus and inferior parietal lobes (Supplemental Figure S4), but no evidence of crossmodal integrative coding different from the unimodal features. The hippocampus and inferior parietal lobes were both sensitive to visual and sound features before and after crossmodal learning (see Supplemental Figure S4c-e). Yet the hippocampus and inferior parietal lobes did not differentiate between the congruent and incongruent conditions or change with experience (see Supplemental Figure S4).” – pg. 20

      (F) The direction of the neural effects was difficult to track and understand. I think the key observation is that TP and PRh both show changes related to cross-modal congruency - but still it would be helpful if the authors could articulate, perhaps via a schematic illustration, how they think representations in each key area are changing with the cross-modal association. Why does the temporal pole come to activate less for congruent than incongruent stimuli (Figure 3)? And why do TP responses grow less similar to one another for congruent relative to incongruent stimuli after learning (Figure 4)? Why are incongruent stimulus similarities anticorrelated in their perirhinal responses following cross-modal learning (Figure 6)?

      We thank the author for identifying this issue, which was also raised by the other reviewers. The reviewer is correct that the key observation is that the TP and PRC both show changes related to crossmodal congruency (given that the unimodal features were equated in the methodological design). However, the structure of the integrative code is less clear, which we now emphasize in the main text. Our findings provide evidence of a crossmodal integrative code that is different from the unimodal features, and future studies are needed to better understand the structure of how such a code might emerge. We now more clearly highlight this distinction throughout the paper:

      “By contrast, perirhinal cortex may be involved in pattern separation following crossmodal experience. In our task, participants had to differentiate congruent and incongruent objects constructed from the same three shape and sound features (Figure 2). An efficient way to solve this task would be to form distinct object-level outputs from the overlapping unimodal feature-level inputs such that congruent objects are made to be orthogonal from the representations before learning (i.e., measured as pattern similarity equal to 0 in the perirhinal cortex; Figure 5b, 6, Supplemental Figure S5), whereas non-learned incongruent objects could be made to be dissimilar from the representations before learning (i.e., anticorrelation, measured as patten similarity less than 0 in the perirhinal cortex; Figure 6). Because our paradigm could decouple neural responses to the learned object representations (on Day 4) from the original component unimodal features at baseline (on Day 2), these results could be taken as evidence of pattern separation in the human perirhinal cortex.11,12 However, our pattern of results could also be explained by other types of crossmodal integrative coding. For example, incongruent object representations may be less stable than congruent object representations, such that incongruent objects representation are warped to a greater extent than congruent objects (Figure 6).” – pg. 18

      “As one solution to the crossmodal binding problem, we suggest that the temporal pole and perirhinal cortex form unique crossmodal object representations that are different from the distributed features in sensory cortex (Figure 4, 5, 6, Supplemental Figure S5). However, the nature by which the integrative code is structured and formed in the temporal pole and perirhinal cortex following crossmodal experience – such as through transformations, warping, or other factors – is an open question and an important area for future investigation. Furthermore, these anterior temporal lobe structures may be involved with integrative coding in different ways. For example, the crossmodal object representations measured after learning were found to be related to the component unimodal feature representations measured before learning in the temporal pole but not the perirhinal cortex (Figure 5, 6, Supplemental Figure S5). Moreover, pattern similarity for congruent shape-sound pairs were lower than the pattern similarity for incongruent shape-sound pairs after crossmodal learning in the temporal pole but not the perirhinal cortex (Figure 4b, Supplemental Figure S3a). As one interpretation of this pattern of results, the temporal pole may represent new crossmodal objects by combining previously learned knowledge. 8,9,10,11,13,14,15,33 Specifically, research into conceptual combination has linked the anterior temporal lobes to compound object concepts such as “hummingbird”.34,35,36 For example, participants during our task may have represented the sound-based “humming” concept and visually-based “bird” concept on Day 1, forming the crossmodal “hummingbird” concept on Day 3; Figure 1, 2, which may recruit less activity in temporal pole than an incongruent pairing such as “barking-frog”. For these reasons, the temporal pole may form a crossmodal object code based on pre-existing knowledge, resulting in reduced neural activity (Figure 3d) and pattern similarity towards features associated with learned objects (Figure 4b).” – pg. 18

      This work represents a key step in our advancing understanding of object representations in the brain. The experimental design provides a useful template for studying neural change related to the cross-modal association that may prove useful to others in the field. Given the broad variety of open questions and potential alternative analyses, an open dataset from this study would also likely be a considerable contribution to the field.

    1. Author Response:

      Reviewer #1:

      In this manuscript Hill et al, analyze immune responses to vaccination of adults with the seasonal influenza vaccine. They perform a detailed analysis of the hemagglutinin-specific binding antibody responses against several different strains of influenza, and antigen-specific CD4+ T cells/T follicular cells, and cytokines in the plasma. Their analysis reveals that: (i) tetramer positive, HA-specific T follicular cells induced 7 days post vaccination correlate with the binding Ab response measured 42 days later; (ii) the HA-specific T fh have a diverse TCR repertoire; (iii) Impaired differentiation of HA-specific T fh in the elderly; and (iv) identification of an "inflammatory" gene signature within T fh in the elderly, which is associated with the impaired development of HA-specific Tfh.

      The paper addresses a topic of considerable interest in the fields of human immunology and vaccinology. In general the experiments appear well performed, and support the conclusions. However, the following points should be addressed to enhance the clarity of the paper, and add support to the key conclusions drawn.

      We thank the reviewer for their supportive evaluation of the manuscript, and have provided the details of how we have addressed each the points raised below.

      1) Abstract: "(cTfh) cells are the best predictor of high titre antibody responses.." Since the authors have not done any blind prediction using machine learning tools with independent cohort, the sentence should be rephrased thus: "cTfh) cells are were associated with high titre antibody responses."

      We agree that this phrasing better reflects the presented data. The sentence in the abstract (page 2) now reads “we show that formation of circulating T follicular helper (cTfh) cells was associated with high titre antibody responses.”

      2) Figure 1A: Please indicate the age range of the subjects.

      Figure 1 has been updated to include the age range of the subjects.

      3) Almost all the data in the paper shows binding Ab titers. Yet, typically HAI titers of MN titers are used to assess Ab responses to influenza. Fig 1C shows HAI titers against the H1N1 Cal 09 strain. Can the authors show HAI titers for Cal 09 and the other A and B strains contained in the 2 vaccine cohorts? Do such HAI titers correlate with the tetramer positive cells, similar to the correlations show in Fig 2e.

      In this manuscript we have deliberately focussed on the immune response to the H1N1 Cal09 strain, as it is the only influenza strain in the vaccine common to both cohorts. The HAI titre for this strain is now shown as supplementary figure 4. In addition, the class II tetramers were specifically selected to recognise unique epitopes in the Cal 09 strain (J. Yang, {..} W. W. Kwok, CD4+ T cells recognize unique and conserved 2009 H1N1 influenza hemagglutinin epitopes after natural infection and vaccination. Int Immunol 25, 447-457, 2013) because of this we do not think it is appropriate to correlate HAI titres for the non-Cal 09 strains with tetramer positive cells. We agree that showing the correlation of cTfh and other immune parameters with the HAI titres for Cal 09 is important and have included this as supplementary figure 7. The new data and text are presented below:

      Figure 1-figure supplement 4: HAI responses before and after vaccination A) Log2 HAI titres at baseline (d0), d7 and d42 for cohort 1 (n=16) and B) cohort 2 (n = 21). C) Correlation between HAI and A.Cali09 IgG as measured by Luminex assay for cohort 1 and 2 combined. p-values determined using paired Wilcoxon signed rank-test, and Pearson’s correlation.

      Text changes. Page 4. “The increase in anti-HA antibody titre was coupled with an increase in hemagglutination inhibitory antibodies to A.Cali09, the one influenza A strain contained in the TIVs that was shared across the two cohorts and showed a positive correlation with the A.Cali09 IgG titres measured by Luminex assay (Fig. 1C, Figure 1-figure supplement 4).”

      Figure 2-figure supplement 1: Correlations between HAI assay titres and selected immune parameters. Correlation between vaccine-induced A.Cali09 HAI titres at d42 with selected immune parameters in both Cohort 1 and Cohort 2 (n=37). Dot color corresponds to the cohort (black = Cohort 1, grey = Cohort 2). Coefficient (Rho) and p-value determined using Spearman’s correlation, and line represents linear regression fit.

      Results text Changes: Page 5. “Similar trends were seen when these immune parameters were correlated to HAI titres against A/Cali09 (Fig Figure 2-figure supplement 1).”

      4) Fig 2d to i: what % of all bulk activated Tfh at day 7 are tetramer positive? The tetramer positive T cells constitute roughly 0.094% of all CD4 T cells (Fig 2d), of which 1/3rd are CXCR5+, PD1+ (i.e. ~0.03% of CD4 T cells). What fraction of all activated Tfh is this subset of tetramer positive cells? Presumably, there will also be Tfh generated against other viral proteins in the vaccine, and these will constitute a significant fraction of all activated Tfh.

      This is an important point, as the tetramers only recognise one peptide epitope of the Cal.09 HA protein, so there will be many other influenza reactive CD4+ T cells that are responding to other Cal 09 epitopes as well as other proteins in the vaccine. The analysis suggested by the reviewer shows that the frequency of Tet+ cells amongst bulk cTfh cells ranges from 0.14%-1.52% in cohort 1, and from 0.022-2.7% in cohort 2. These data have been included as Figure Figure 1-figure supplement 6C, D in the revised manuscript. In addition, Tet+ cells as a percentage of bulk cTfh cells were reduced in older people compared to younger adults. This data has been included in Figure 5-figure supplement 1C in the revised manuscript.

      Figure 1-figure supplement 6: Percentage of cTfh cells that are Tet+ and CXCR3 and CCR6 expression on HA-specific CD4+ T cells. A) Representative flow cytometry gating strategy for CXCR5+PD-1+ cTfh cells on CD4+CD45RA- T cells, and the proportion of HA-specific Tet+ cells within the CXCR5+PD-1+ cTfh cell gate. B) Percentage Tet+ cells within the CXCR5+PD-1+ cTfh cell population. Within-cohort age group differences were determined using the Mann-Whitney U test.

      Results text, page 4: These antigen-specific T cells had upregulated ICOS after immunisation, indicating that they have been activated by vaccination (Fig. 1F, G). In addition, a median of one third of HA-specific T cells upregulated the Tfh markers CXCR5 and PD1 on d7 after immunisation (Fig. 1H, I). The tetramer binding cells represented between 0.022-2.7% of the total CXCR5+PD-1+ bulk population (Fig Figure 1-figure supplement 6A, B).

      Figure 5-figure supplement 1C: Age-related differences in cytokines and HA-specific CD4+ T cell parameters. C) Percentage Tet+ cells within the CXCR5+PD-1+ cTfh cell population. Within-cohort age group differences were determined using the Mann-Whitney U test.

      Results text, page 8: Across both cohorts, the only CD4+ T cell parameters consistently reduced in older individuals at d7 were the frequency of polyclonal cTfh cells and HA-specific Tet+ cTfh cells, with the strongest effect within the antigen-specific cTfh cell compartment (Fig. 5H-J, Figure 5-figure supplement 1C).

      Reviewer #2:

      Hill and colleagues present a comprehensive dataset describing the recall and expansion of HA-specific cTFH cells following influenza immunisation in two cohorts. Using class II tetramers, IgG titres against a large panel of HA antigens, and quantification of plasma cytokines, they find that activated and HA-specific cTFH cells were a strong predictor of the IgG response against the vaccine after 6 weeks. Using RNAseq and TCR clonotype analysis, they find that, in 10/15 individuals, the HA-specific cTFH response at day 7 post-vaccination is recalled from the available CD4 T cell memory pool present prior to vaccination. Post-vaccination HA-specific cTFH cells exhibited a transcriptional profile consistent with lymph node-derived GC TFH, as well as evidence of downregulation of IL-2 signaling pathways relative to pre-vaccine CD4 memory cells.

      The authors then apply these findings to a comparison of vaccine immunogenicity between younger (18-36) and older (>65) adults. As expected, they found lower levels of vaccine-specific IgG responses among the older cohort. Analysis of HA-specific T cell responses indicated that tet+ cTFH fail to properly develop in the older cohort following vaccination. Further analysis suggests that development of HA-specific cTFH in older individuals is not caused by a lack of TCR diversity, but is associated with higher expression of inflammation-associated transcripts in tet+ cTFH.

      Overall this is an impressive study that provides clarity around the recall of HA-specific CD4 T cell memory, and the burst of HA-specific cTFH cells observed 7 days post-vaccination. The association between defective cTFH recall and lower IgG titres post-vaccination in older individuals provides new targets for improving influenza vaccine efficacy in this age group. However, as currently presented, the model of impaired cTFH differentiation in the older cohort and the link to inflammation is somewhat unclear. There are several issues that could be clarified to improve the manuscript in its current form:

      We thank the reviewer for their supportive and comprehensive summary of our work. We agree that the link between impaired inflammation and cTfh differentiation is correlative, we have added new data to address this, including mechanistic data to support chronic IL-2 signalling as antagonistic to cTfh development, as well as providing new analyses to address the other points raised.

      1) It is somewhat unclear the extent to which the reduction in HA-specific cTFH in the older cohort is also related to an overall reduction in T cell expansion - cohort 1 shows a significant reduction in total tet+ CD4 T cells post-vaccination as well as in the cTFH compartment, and while this difference may not reach statistical significance, a similar trend is shown for cohort 2.

      We agree that a possible interpretation is a global failure in T cell expansion in the older individuals. To determine whether there is a relationship between the degree of Tet+ CD4+ T cell expansion and cTfh cell differentiation with age, we performed correlation analyses. There is no correlation between the expansion of Tet+ cells and the frequency of cTfh cells formed seven days after immunisation in either age group. This suggests that the impaired cTfh cell differentiation in older persons is most likely caused by factors other than the capacity of CD4+ T cells to expand after vaccination. These data have been added as Figure 5-figure supplement 1D, and included in the results text on page 8.

      Figure 5-figure supplement 1D: Age-related differences in cytokines and HA-specific CD4+ T cell parameters. D) Correlation between Tet+ cells (d7-d0, % of CD4+) and cTfh (d7-d0, % of TET+) in both cohorts for each age-group (18- 36 y.o n=37, 65+ y.o. n= 39). Dot color corresponds to the cohort (black = Cohort 1, grey = Cohort 2). Coefficient (Rho) and p-value determined using Spearman’s correlation, and line represents linear regression fit.

      Text changes, Page 8: There was no consistent difference in the total d7 Tet+ HA-specific T cell population with age for both cohorts (Fig. 5H) and we observed no age-related correlation between the ability of an individual to differentiate Tet+ cells into a cTfh cell and the overall expansion of Tet+ HA-specific T cell population (Figure 5-figure supplement 1D). Thus, our data suggests that the poor vaccine antibody responses in older individuals is impacted by impaired cTfh cell differentiation (Fig. 5J) rather than size of the vaccine-specific CD4+ T cell pool.

      2) Transcriptomic analysis indicates that HA-specific cTFH in the older cohort show impaired downregulation of inflammation, TNF and IL-2-related signaling pathways. The authors therefore conclude that excess inflammation can limit the response to vaccination. In its current presentation, the data does not necessarily support this conclusion. While it is clear that downregulation of TNF and IL-2 signalling pathways occur during cTFH/TFH differentiation, there is no evidence presented to support the idea that (a) vaccination results in increased pro-inflammatory cytokine production in lymphoid organs in older individuals or that (b) these pro-inflammatory cytokines actively promote CXCR5-, rather than cTFH, differentiation of existing memory T cells.

      We agree with the reviewer that the data presented in figure 7 are correlative, rather than causative. Unfortunately, we do not have access to secondary lymphoid tissues from younger and older people after vaccination to test point (a) above. In order to test the hypothesis that increased inflammatory cytokine production in lymphoid organs limits Tfh cell differentiation we have used Il2cre/+; Rosa26stop-flox-Il2/+ transgenic mice. In this mouse model, IL-2-dependent cre- recombinase activity facilitates the expression of low levels of IL-2 in cells that have previously expressed IL-2. This creates a scenario in which cells that physiologically express IL-2 cannot turn its expression off therefore increasing expression IL-2 after antigenic stimulation (mice reported in Whyte et al., bioRxiv, 2020, doi: https://doi.org/10.1101/2020.12.18.423431).

      Twelve days after influenza A infection, Il2cre/+; Rosa26stop-flox-Il2/+ transgenic mice have fewer Tfh cells in the draining mediastinal lymph node and in the spleen (Fig. 8A-C), this is accompanied by a reduction in the magnitude of the GC B cell response (Fig. 8D-E). These data provide a proof of concept that sustained IL-2 production limit the formation of Tfh cells, consistent with the negative correlation of an IL-2 signalling gene signature and cTfh cell formation in humans (Figure 7). These new data support the conclusion that excess IL-2 signalling can limit the Tfh cell response. These data are presented in Figure 8, and are discussed on page 12 in the results, and pages 12-13 in the discussion.

      Figure 8: Increased IL-2 production impairs Tfh cell formation and the germinal centre response. Assessment of the Tfh cell and germinal centre response in Il2cre/+; Rosa26stop-flox-Il2/+ transgenic mice that do not switch off IL-2 production, and Il2cre/+; Rosa26+/+ control mice 12 days after influenza A infection. Flow cytometric contour plots (A) and quantification of the percentage of CXCR5highPD-1highFoxp3-CD4+ Tfh cells in the mediastinal lymph node (B) and spleen (C). Flow cytometric contour plots (D) and quantification of the percentage of Bcl6+Ki67+B220+ germinal centre B cells in the mediastinal lymph node (E) and spleen (F). The height of the bars indicates the median, each symbol represents one mouse, data are pooled from two independent experiments. P-values calculated between genotype-groups by Mann Whitney U test.

      Results text, page 12: Sustained IL-2 production inhibits Tfh cell frequency and the germinal centre response. To test the hypothesis that cytokine signalling needs to be curtailed to facilitate Tfh cell differentiation turned to a genetically modified mouse model in which cells that have initiated IL-2 production cannot switch it off, Il2cre/+; Rosa26stop-flox-Il2/+ mice (37). Twelve days after influenza infection Il2cre/+; Rosa26stop-flox-Il2/+ mice have fewer Tfh cells in the draining lymph node and spleen (Fig. 8A-C), which is associated with a reduced frequency of germinal center B cells (Fig. 8D-F). This provides a proof of concept that proinflammatory cytokine production needs to be limited to enable full Tfh cell differentiation in secondary lymphoid organs.

      Discussion text, pages 12, 13: These enhanced inflammatory signatures associated with poor antibody titre in an independent cohort of influenza vaccinees. The dampening of Tfh cell formation by enhanced cytokine production was confirmed by the use of genetically modified mice where IL-2 production is restricted to the appropriate anatomical and cellular compartments, but once initiated cannot be inactivated. Together, this suggests that formation of antigen-specific Tfh cells is essential for high titre antibody responses, and that excessive inflammatory factors can contribute to poor cTfh cell responses.

    1. Author Responses

      Reviewer #1 (Public Review):

      This study uses a nice longitudinal dataset and performs relatively thorough methodological comparisons. I also appreciate the systematic literature review presented in the introduction. The discussion of confound control is interesting and it is great that a leave-one-site-out test was included. However, the prediction accuracy drops in these important leave-one-site-out analyses, which should be assessed and discussed further.

      Furthermore, I think there is a missed opportunity to test longitudinal prediction using only pre-onset individuals to gain clearer causal insights. Please find specific comments below, approximately in order of importance.

      We thank the reviewers for their positive remarks and for providing important suggestions to improve the analysis. Please see our detailed comments below.

      1) The leave-one-site-out results fail to achieve significant prediction accuracy for any of the phenotypes. This reveals a lack of cross-site generalizability of all results in this work. The authors discuss that this variance could be caused by distributed sample sizes across sites resulting in uneven folds or site-specific variance. It should be possible to test these hypotheses by looking at the relative performance across CV folds. The site-specific variance hypothesis may be likely because for the other results confounds are addressed using oversampling (i.e., sampling with replacement) which creates a large sample with lower variance than a random sample of the same size. This is an important null finding that may have important implications, so I do not think that it is cause for rejection. However, it is a key element of this paper and I think it should be assessed further and discussed more widely in the abstract and conclusion.

      We thank the reviewer for raising this point and providing specific suggestions. As mentioned by the reviewer, the leave-one-site-out results showed high-variance across sites, that is, across cross validation (CV) folds. Therefore, as suggested by the reviewer, we further investigated the source of this variance by observing how the model accuracies correlates with each site and its sample sizes, ratio of AAM-to-controls, and the sex distribution in each site. We ranked the sites from low to high accuracy and observed different performance metrics such as sensitivity and specificity:

      As shown, the models performed close-to-chance for sites ‘Dublin’, ‘Paris’ and ‘Berlin’ (<60% mean balanced accuracy) in the leave-one-site-out experiment, across all time-points and metrics. Notably, the order of the performance at each site does not correspond to the sample sizes (please refer to the ‘counts’ column in the above figure). It also does not correspond to the ratio of AAM-to-controls, or to the sex distribution.

      To further investigate this, we performed another additional leave-one-site-out experiment with all 8 sites. Here, we repeated the ML (Machine Learning) exploration by using the entire data, including the data from the Nottingham site that was kept aside as the holdout. Since there are 8 sites now, we used a 8-fold cross validation and observed how the model accuracy varied across each site:

      The results were comparable to the original leave-one-site-out experiment. Along with ‘Dublin’ and Berlin’, the models additionally performed poorly on the ‘Nottingham’ site. Results on ‘London’ and ‘Paris’ also fell below 60% mean balanced accuracy.

      Finally, we compared the above two results to the main experiment from the paper where the test samples were randomly sampled across all sites. The performance on test subjects from each site was compared:

      As seen, the models struggled with subjects from ‘Dublin’ followed by ‘Nottingham’ ‘London’ and ‘Berlin’ respectively, and performed well on subjects from ‘Dresden’, ‘Mannheim’, ‘Hamburg’ and ‘Paris’.

      Across all the three results discussed above, the models consistently struggle to generalize to subjects particularly from ‘Dublin’ and ‘Nottingham’. As already pointed out by the reviewer, the variance in the main experiment in the manuscript is lower because of the random sampling of the test set across all sites. Since these results have important implications, we have included them in the manuscript and also provided these figures in the Appendix.

      2) The authors state that "83.3% of subjects reported having no or just one binge drinking experience until age 14". To gain clearer insights into the causality, I recommend repeating the MRIage14 → AAMage22 prediction using only these 83% of subjects.

      We thank the reviewer for this valuable comment. As suggested by the reviewer, we now repeated the MRIage14 → AAMage22 analysis by including (a) only the subjects who had no binge drinking experiences (n=477) by age 14 and (b) subjects who had one or less binge drinking experiences (n=565). The results are shown below. The balanced accuracy on the holdout set were 72.9 +/- 2% and 71.1 +/- 2.3% respectively, which is comparable to the main result of 73.1 +/- 2%.

      These results provide further evidence that certain form of cerebral predisposition might be preceding the observed alcohol misuse behavior in the IMAGEN dataset. We discuss these results now in the Results section and the 2nd paragraph of Discussion.

      3) The feature importance results for brain regions are quite inconsistent across time points. As such, the study doesn't really address one of the main challenges with previous work discussed in the introduction: "brain regions reported were not consistent between these studies either and do not tell a coherent story". This would be worth looking into further, for example by looking at other indices of feature importance such as permutation-based measures and/or investigating the stability of feature importance across bootstrapped CV folds.

      The feature importance results shown in Figure 9 is intended to be illustrative and show where the most informative structural features are mainly clustered around in the brain, for each time point. We would like to acknowledge that this figure could be a bit confusing. Hence, we have now provided an exhaustive table in the Appendix, consisting of all important features and their respective SHAP scores obtained across the seven repeated runs. In addition, we address the inconsistencies across time points in the 3rd paragraph in the Discussion chapter and contrast our findings with previous studies. These claims can now be verified from the table of features provided in the Appendix.

      Addressing the reviewer's suggestions, we would like to point out that SHAP is itself a type of permutation-based measure of feature importance. Since it derives from the theoretically-sound shapley values, is model agnostic, and has been already applied for biomedical applications, we believe that running another permutation-based analysis would not be beneficial. We have also investigated the stability of our feature importance scores by repeating the SHAP estimation with different random permutations. This process is explained in the Methods section Model Interpretation.

      Additionally now, the SHAP scores across the seven repetitions are also provided in the Appendix table 6 for verification.

    1. Author Response

      We thank the reviewers for their positive feedback and thoughtful suggestions that will improve our manuscript. Here we summarise our plan for immediate action. We will resubmit our manuscript once additional experiments have been performed to clarify all the major and minor concerns of the reviewers and the manuscript has been revised. At that point, we will respond to all reviewer’s points and highlight the changes made in the text.

      Reviewer #1 (Public Review):

      The authors have tried to correlate changes in the cellular environment by means of altering temperature, the expression of key cellular factors involved in the viral replication cycle, and small molecules known to affect key viral protein-protein interactions with some physical properties of the liquid condensates of viral origin. The ideas and experiments are extremely interesting as they provide a framework to study viral replication and assembly from a thermodynamic point of view in live cells.

      The major strengths of this article are the extremely thoughtful and detailed experimental approach; although this data collection and analysis are most likely extremely time-consuming, the techniques used here are so simple that the main goal and idea of the article become elegant. A second major strength is that in other to understand some of the physicochemical properties of the viral liquid inclusion, they used stimuli that have been very well studied, and thus one can really focus on a relatively easy interpretation of most of the data presented here.

      There are three major weaknesses in this article. The way it is written, especially at the beginning, is extremely confusing. First, I would suggest authors should check and review extensively for improvements to the use of English. In particular, the abstract and introduction are extremely hard to understand. Second, in the abstract and introduction, the authors use terms such as "hardening", "perturbing the type/strength of interactions", "stabilization", and "material properties", for just citing some terms. It is clear that the authors do know exactly what they are referring to, but the definitions come so late in the text that it all becomes confusing. The second major weakness is that there is a lack of deep discussion of the physical meaning of some of the measured parameters like "C dense vs inclusion", and "nuclear density and supersaturation". There is a need to explain further the physical consequences of all the graphs. Most of them are discussed in a very superficial manner. The third major weakness is a lack of analysis of phase separations. Some of their data suggest phase transition and/or phase separation, thus, a more in-deep analysis is required. For example, could they calculate the change of entropy and enthalpy of some of these processes? Could they find some boundaries for these transitions between the "hard" (whatever that means) and the liquid?

      The authors have achieved almost all their goals, with the caveat of the third weakness I mentioned before. Their work presented in this article is of significant interest and can become extremely important if a more detailed analysis of the thermodynamics parameters is assessed and a better description of the physical phenomenon is provided.

      We thank reviewer 1 for the comments and, in particular, for being so positive regarding the strengths of our manuscript and for raising concerns that will surely improve the manuscript. At this point, we propose the following actions to address the concerns of Reviewer 1:

      1) We will extensively revise the use of English, particularly, in the abstract and introduction, defining key terms as they come along in the text to make the argument clearer.

      2) We acknowledge the importance of discussing our data in more detail and we propose the following. We will discuss the graphs and what they mean as exemplified in the paragraph below.

      Regarding Figure 3 - As the concentration of vRNPs increases, we observe an increase in supersaturation until 12hpi. This means that contrary to what is observed in a binary mixture, in which the Cdilute is constant (Klosin et al., 2020), the Cdilute in our system increases with concentration. It has been reported that Cdilute increases in a multi-component system with bulk concentration (Riback et al., 2020). Our findings have important implications for how we think about the condensates formed during influenza infection. As the 8 different genomic vRNPs have a similar overall structure, they could, in theory, behave as a binary system between units of vRNPs and Rab11a. However, a change in Cdilute with concentration shows that our system behaves as a multi-component system. This means that the differences in length, RNA sequence and valency that each vRNP have are key for the integrity of condensates.

      3) The reviewer calls our attention to the lack of analysis of phase separations. We think that phase separation (or percolation coupled to phase separation) governs the formation of influenza A virus condensates. However, we think we ought to exert caution at this point as the condensates we are working with are very complex and that the physics of our system in cells may not be sufficient to claim phase separation without an in vitro reconstitution system. In fact, IAV inclusions contain cellular membranes, different vRNPs and Rab11a. So far, we can only speculate that the liquid character of IAV inclusions may arise from a network of interacting vRNPs that bridge several cognate vRNP-Rab11 units on flexible membranes, similarly to what happens in phase separated vesicles in neurological synapses. However, the speculative model for our system, although being supported by correlative light and electron microscopy, currently lacks formal experimental validation.

      For this reason, we thought of developing the current work as an alternative to explore the importance of the liquid material properties of IAV inclusions. By finding an efficient method to alter the material properties of IAV inclusions, we provide proof of principle that it is possible to impose controlled phase transitions that reduce the dynamics of vRNPs in cells and negatively impact progeny virion production. Despite having discussed these issues in the limitations of the study, we will make our point clearer.

      We are currently establishing an in vitro reconstitution system to formally demonstrate, in an independent publication, that IAV inclusions are formed by phase separation. For this future work, we teamed up with Pablo Sartori, a theorical physicist to derive in- depth analysis of the thermodynamics of the viral liquid condensates. Collectively, we think that cells have too many variables to derive meaningful physics parameters (such as entropy and enthalpy) as well as models and need to be complemented by in vitro systems. For example, increasing the concentration inside a cell is not a simple endeavour as it relies on cellular pathways to deliver material to a specific place. At the same time, the 8 vRNPs, as mentioned above, have different size, valency and RNA sequence and can behave very differently in the formation of condensates and maintenance of their material properties. Ideally, they should be analysed individually or in selected combinations. For the future, we will combine data from in vitro reconstitution systems and cells to address this very important point raised by the reviewer.

      From the paper on the section Limitations of the study: “Understanding condensate biology in living cells is physiologically relevant but complex because the systems are heterotypic and away from equilibria. This is especially challenging for influenza A liquid inclusions that are formed by 8 different vRNP complexes, which although sharing the same structure, vary in length, valency, and RNA sequence. In addition, liquid inclusions result from an incompletely understood interactome where vRNPs engage in multiple and distinct intersegment interactions bridging cognate vRNP-Rab11 units on flexible membranes (Chou et al., 2013; Gavazzi et al., 2013; Haralampiev et al., 2020; Le Sage et al., 2020; Shafiuddin & Boon, 2019; Sugita, Sagara, Noda, & Kawaoka, 2013). At present, we lack an in vitro reconstitution system to understand the underlying mechanism governing demixing of vRNP-Rab11a-host membranes from the cytosol. This in vitro system would be useful to explore how the different segments independently modulate the material properties of inclusions, explore if condensates are sites of IAV genome assembly, determine thermodynamic values, thresholds accurately, perform rheological measurements for viscosity and elasticity and validate our findings”.

      Reviewer #2 (Public Review):

      During Influenza virus infection, newly synthesized viral ribonucleoproteins (vRNPs) form cytosolic condensates, postulated as viral genome assembly sites and having liquid properties. vRNP accumulation in liquid viral inclusions requires its association with the cellular protein Rab11a directly via the viral polymerase subunit PB2. Etibor et al. investigate and compare the contributions of entropy, concentration, and valency/strength/type of interactions, on the properties of the vRNP condensates. For this, they subjected infected cells to the following perturbations: temperature variation (4, 37, and 42{degree sign}C), the concentration of viral inclusion drivers (vRNPs and Rab11a), and the number or strength of interactions between vRNPs using nucleozin a well-characterized vRNP sticker. Lowering the temperature (i.e. decreasing the entropic contribution) leads to a mild growth of condensates that does not significantly impact their stability. Altering the concentration of drivers of IAV inclusions impact their size but not their material properties. The most spectacular effect on condensates was observed using nucleozin. The drug dramatically stabilizes vRNP inclusions acting as a condensate hardener. Using a mouse model of influenza infection, the authors provide evidence that the activity of nucleozin is retained in vivo. Finally, using a mass spectrometry approach, they show that the drug affects vRNP solubility in a Rab11a-dependent manner without altering the host proteome profile.

      The data are compelling and support the idea that drugs that affect the material properties of viral condensates could constitute a new family of antiviral molecules as already described for the respiratory syncytial virus (Risso Ballester et al. Nature. 2021).

      Nevertheless, there are some limitations in the study. Several of them are mentioned in a dedicated paragraph at the end of a discussion. This includes the heterogeneity of the system (vRNP of different sizes, interactions between viral and cellular partners far from being understood), which is far from equilibrium, and the absence of minimal in vitro systems that would be useful to further characterize the thermodynamic and the material properties of the condensates.

      We thank reviewer 2 for highlighting specific details that need improving and raising such interesting questions to validate our findings. We will address all the minor comments of Reviewer 2. To address the comments of Reviewer 2, we propose the actions described in blue below each point raised that is written in italics.

      1) The concentrations are mostly evaluated using antibodies. This may be correct for Cdilute. However, measurement of Cdense should be viewed with caution as the antibodies may have some difficulty accessing the inner of the condensates (as already shown in other systems), and this access may depend on some condensate properties (which may evolve along the infection). This might induce artifactual trends in some graphs (as seen in panel 2c), which could, in turn, affect the calculation of some thermodynamic parameters.

      The concern of using antibodies to calculate Cdense is valid. We will address this concern by validating our results using a fluorescent tagged virus that has mNeon Green fused to the viral polymerase PA (PA-mNeonGreen PR8 virus). Like NP, PA is a component of vRNPs and labels viral inclusions, colocalising with Rab11 when vRNPs are in the cytosol without the need of using antibodies.

      This virus would be the best to evaluate inclusion thermodynamics, where it not an attenuated virus (Figure 1A below) with a delayed infection as demonstrated by the reduced levels of viral proteins (Figure 1B below). Consistently, it shows differences in the accumulation of vRNPs in the cytosol and viral inclusions form later in infection. After their emergence, inclusions behave as in the wild-type virus (PR8-WT), fusing and dividing (Figure 1C below) and displaying liquid properties. The differences in concentration may shift or alter thermodynamic parameters such as time of nucleation, nucleation density, inclusion maturation rate, Cdense, Cdilute. This is the reason why we performed the thermodynamics profiling using antibodies upon PR8-WT infection. For validating our results, and taking into account a possible delayed kinetics, and differenced that may occur because of reduced vRNP accumulation in the cytosol, this virus will be useful and therefore we will repeat the thermodynamics using it.

      As a side note, vRNPs are composed of viral RNA coated with several molecules of NP and each vRNP also contains 1 copy of the trimeric RNA dependent RNA polymerase formed by PA, PB1 and PB2. It is well documented that in the cytosol the vast majority of PA (and other components of the polymerase) is in the form of vRNPs (Avilov, Moisy, Munier, et al., 2012; Avilov, Moisy, Naffakh, & Cusack, 2012; Bhagwat et al., 2020; Lakdawala et al., 2014), and thus we can use this virus to label vRNPs on condensates to corroborate our studies using antibodies.

      Figure 1 – The PA- mNeonGreen virus is attenuated in comparison to the WT virus. A. Cells (A549) were infected or mock-infected with PR8 WT or PA- mNeonGreen (PA-mNG) viruses, at a multiplicity of infection (MOI) of 3, for the indicated times. Viral production was determined by plaque assay and plotted as plaque forming units (PFU) per milliliter (mL) ± standard error of the mean (SEM). Data are a pool from 2 independent experiments. B. The levels of viral PA, NP and M2 proteins and actin in cell lysates at the indicated time points were determined by western blotting. C. Cells (A549) were transfected with a plasmid encoding mCherry-NP and co-infected with PA-mNeonGreen virus for 16h, at an MOI of 10. Cells were imaged under time-lapse conditions starting at 16 hpi. White boxes highlight vRNPs/viral inclusions in the cytoplasm in the individual frames. The dashed white and yellow lines mark the cell nucleus and the cell periphery, respectively. The yellow arrows indicate the fission/fusion events and movement of vRNPs/ viral inclusions. Bar = 10 µm. Bar in insets = 2 µm.

      2) Although the authors have demonstrated that vRNP condensates exhibit several key characteristics of liquid condensates (they fuse and divide, they dissolve upon hypotonic shock or upon incubation with 1,6-hexanediol, FRAP experiments are consistent with a liquid nature), their aspect ratio (with a median above 1.4) is much higher than the aspect ratio observed for other cellular or viral liquid compartments. This is intriguing and might be discussed.

      IAV inclusions have been shown to interact with microtubules and the endoplasmic reticulum, that confers movement, and also undergo fusion and fission events. We propose that these interactions and movement impose strength and deform inclusions making them less spherical. To validate this assumption, we compared the aspect ratio of viral inclusions in the absence and presence of nocodazole (that abrogates microtubule-based movement). The data in figure 2 shows that in the presence of nocodazole, the aspect ratio decreases from 1.42±0.36 to 1.26 ±0.17, supporting our assumption.

      Figure 2 – Treatment with nocodazole reduces the aspect ratio of influenza A virus inclusions. Cells (A549) were infected PR8 WT and treated with nocodazole (10 µg/mL) for 2h time after which the movement of influenza A virus inclusions was captured by live cell imaging. Viral inclusions were segmented, and the aspect ratio measured by imageJ, analysed and plotted in R.

      3) Similarly, the fusion event presented at the bottom of figure 3I is dubious. It might as well be an aggregation of condensates without fusion.

      We will change this, thank you for the suggestion.

      4) The authors could have more systematically performed FRAP/FLAPh experiments on cells expressing fluorescent versions of both NP and Rab11a to investigate the influence of condensate size, time after infection, or global concentrations of Rab11a in the cell (using the total fluorescence of overexpressed GFP-Rab11a as a proxy) on condensate properties.

      We will try our best to be able to comply with this suggestion as we think it is important.

      Reviewer #3 (Public Review):

      This study aims to define the factors that regulate the material properties of the viral inclusion bodies of influenza A virus (IAV). In a cellular model, it shows that the material properties were not affected by lowering the temperature nor by altering the concentration of the factors that drive their formation. Impressively, the study shows that IAV inclusions may be hardened by targeting vRNP interactions via the known pharmacological modulator (also an IAV antiviral), nucleozin, both in vitro and in vivo. The study employs current state-of-the-art methodology in both influenza virology and condensate biology, and the conclusions are well-supported by data and proper data analysis. This study is an important starting point for understanding how to pharmacologically modulate the material properties of IAV viral inclusion bodies.

      We thank this reviewer for all the positive comments. We will address the minor issues brought to our attention entirely, including changing the tittle of the manuscript and we will investigate the formation and material properties of IAV inclusions in the presence and absence of nucleozin for the nucleozin escape mutant NP-Y289H.

      References

      Avilov, S. V., Moisy, D., Munier, S., Schraidt, O., Naffakh, N., & Cusack, S. (2012). Replication- competent influenza A virus that encodes a split-green fluorescent protein-tagged PB2 polymerase subunit allows live-cell imaging of the virus life cycle. J Virol, 86(3), 1433- 1448. doi:10.1128/JVI.05820-11

      Avilov, S. V., Moisy, D., Naffakh, N., & Cusack, S. (2012). Influenza A virus progeny vRNP trafficking in live infected cells studied with the virus-encoded fluorescently tagged PB2 protein. Vaccine, 30(51), 7411-7417. doi:10.1016/j.vaccine.2012.09.077

      Bhagwat, A. R., Le Sage, V., Nturibi, E., Kulej, K., Jones, J., Guo, M., . . . Lakdawala, S. S. (2020). Quantitative live cell imaging reveals influenza virus manipulation of Rab11A transport through reduced dynein association. Nat Commun, 11(1), 23. doi:10.1038/s41467-019-13838-3

      Chou, Y. Y., Heaton, N. S., Gao, Q., Palese, P., Singer, R. H., & Lionnet, T. (2013). Colocalization of different influenza viral RNA segments in the cytoplasm before viral budding as shown by single-molecule sensitivity FISH analysis. PLoS Pathog, 9(5), e1003358. doi:10.1371/journal.ppat.1003358

      Gavazzi, C., Yver, M., Isel, C., Smyth, R. P., Rosa-Calatrava, M., Lina, B., . . . Marquet, R. (2013). A functional sequence-specific interaction between influenza A virus genomic RNA segments. Proc Natl Acad Sci U S A, 110(41), 16604-16609. doi:10.1073/pnas.1314419110

      Haralampiev, I., Prisner, S., Nitzan, M., Schade, M., Jolmes, F., Schreiber, M., . . . Herrmann, A. (2020). Selective flexible packaging pathways of the segmented genome of influenza A virus. Nat Commun, 11(1), 4355. doi:10.1038/s41467-020-18108-1

      Klosin, A., Oltsch, F., Harmon, T., Honigmann, A., Julicher, F., Hyman, A. A., & Zechner, C. (2020). Phase separation provides a mechanism to reduce noise in cells. Science, 367(6476), 464-468. doi:10.1126/science.aav6691

      Lakdawala, S. S., Wu, Y., Wawrzusin, P., Kabat, J., Broadbent, A. J., Lamirande, E. W., . . . Subbarao, K. (2014). Influenza a virus assembly intermediates fuse in the cytoplasm. PLoS Pathog, 10(3), e1003971. doi:10.1371/journal.ppat.1003971

      Le Sage, V., Kanarek, J. P., Snyder, D. J., Cooper, V. S., Lakdawala, S. S., & Lee, N. (2020). Mapping of Influenza Virus RNA-RNA Interactions Reveals a Flexible Network. Cell Rep, 31(13), 107823. doi:10.1016/j.celrep.2020.107823

      Riback, J. A., Zhu, L., Ferrolino, M. C., Tolbert, M., Mitrea, D. M., Sanders, D. W., . . . Brangwynne, C. P. (2020). Composition-dependent thermodynamics of intracellular phase separation. Nature, 581(7807), 209-214. doi:10.1038/s41586-020-2256-2

      Shafiuddin, M., & Boon, A. C. M. (2019). RNA Sequence Features Are at the Core of Influenza a Virus Genome Packaging. J Mol Biol. doi:10.1016/j.jmb.2019.03.018

      Sugita, Y., Sagara, H., Noda, T., & Kawaoka, Y. (2013). Configuration of viral ribonucleoprotein complexes within the influenza A virion. J Virol, 87(23), 12879- 12884. doi:10.1128/JVI.02096-13

    2. Author Response

      Reviewer #1 (Public Review):

      The authors have tried to correlate changes in the cellular environment by means of altering temperature, the expression of key cellular factors involved in the viral replication cycle, and small molecules known to affect key viral protein-protein interactions with some physical properties of the liquid condensates of viral origin. The ideas and experiments are extremely interesting as they provide a framework to study viral replication and assembly from a thermodynamic point of view in live cells.

      The major strengths of this article are the extremely thoughtful and detailed experimental approach; although this data collection and analysis are most likely extremely time-consuming, the techniques used here are so simple that the main goal and idea of the article become elegant. A second major strength is that in other to understand some of the physicochemical properties of the viral liquid inclusion, they used stimuli that have been very well studied, and thus one can really focus on a relatively easy interpretation of most of the data presented here.

      There are three major weaknesses in this article. The way it is written, especially at the beginning, is extremely confusing. First, I would suggest authors should check and review extensively for improvements to the use of English. In particular, the abstract and introduction are extremely hard to understand. Second, in the abstract and introduction, the authors use terms such as "hardening", "perturbing the type/strength of interactions", "stabilization", and "material properties", for just citing some terms. It is clear that the authors do know exactly what they are referring to, but the definitions come so late in the text that it all becomes confusing. The second major weakness is that there is a lack of deep discussion of the physical meaning of some of the measured parameters like "C dense vs inclusion", and "nuclear density and supersaturation". There is a need to explain further the physical consequences of all the graphs. Most of them are discussed in a very superficial manner. The third major weakness is a lack of analysis of phase separations. Some of their data suggest phase transition and/or phase separation, thus, a more in-deep analysis is required. For example, could they calculate the change of entropy and enthalpy of some of these processes? Could they find some boundaries for these transitions between the "hard" (whatever that means) and the liquid?

      The authors have achieved almost all their goals, with the caveat of the third weakness I mentioned before. Their work presented in this article is of significant interest and can become extremely important if a more detailed analysis of the thermodynamics parameters is assessed and a better description of the physical phenomenon is provided.

      We thank you for the comments and, in particular, for being so positive regarding the strengths of our manuscript and for raising concerns that will surely improve it. We have taken the following actions to address your concerns:

      1) Extensive revisions have been made to the use of English, particularly in the abstract and introduction. Key terms are defined as they are introduced in the text to enhance the clarity of the argument. This is a significant revision that is highlighted within the text, but it is too extensive to detail here.

      2) In the results section, we improved and extended the discussion of our graphs to the extent possible. However, we found that attempting to explain the graphs' meanings more thoroughly would detract from our manuscript's main focus: identifying thermodynamic changes that could potentially lead to alterations in material properties, specifically aspect ratio, size, and Gibbs free energy. As a result, we introduced the type of information we could obtain from our analyses in the introduction (Lines 112-125) and briefly commented on it in the ‘results’ section (Lines 304-306, sentences below).

      From introduction – lines 112-125:

      “In addition, other parameters like nucleation density determine how many viral condensates are formed per area of cytosol. Overall, the data will inform us if changing one parameter, e.g. the concentration, drives the system towards larger condensates with the same or more stable properties, or more abundant condensates that are forced to maintain the initial or a different size on account of available nucleation centres (Riback et al., 2020:Snead, 2022 #1152). It will also inform us if liquid viral inclusions behave like a binary or a multi-component system. In a binary mixture, Cdilute is constant (Klosin et al., 2020). However, in multi-component systems, Cdilute increases with bulk concentration (Riback et al., 2020). This type of information could have direct implications about the condensates formed during influenza infection. As the 8 different genomic vRNPs have a similar overall structure, they could, in theory, behave as a binary system between units of vRNPs and Rab11a. However, a change in Cdilute with concentration would mean that the system behaves as a multi-component system. This could raise the hypothesis that the differences in length, RNA sequence and valency that each vRNP has may be relevant for the integrity and behaviour of condensates.”.

      From results lines 304-306:

      This indicates that the liquid inclusions behave as a multi-component system and allow us to speculate that the differences in length, RNA sequence and valency that each vRNP may be key for the integrity and behaviour of condensates.

      3) The reviewer has drawn our attention to the absence of phase separation analysis in our study. We believe that the formation of influenza A virus condensates is governed by phase separation (or percolation coupled to phase separation). However, we must exercise caution at this point because the condensates we are studying are highly complex, and the physics of our cellular system may not be adequate to claim phase separation without being validated by an in vitro reconstitution system. IAV inclusions contain a variety of cellular membranes, different vRNPs, and Rab11a. While we have robust data to propose a model in which the liquid-like properties of IAV inclusions arise from a network of interacting vRNPs that bridge multiple cognate vRNP-Rab11 units on flexible membranes, similar to what occurs in phase-separated vesicles in neurological synapses, our model for this system still lacks formal experimental validation. As a note, the data supporting our model includes: the demonstration of the liquid properties of our liquid inclusions (Alenquer et al. 2019, Nature Communications, 10, 1629); and impairment of recycling endocytic activity during IAV infection Bhagwat et al. 2020, Nat Commun, 11, 23; Kawaguchi et al. 2012, J Virol, 86, 11086-95; Vale-costa et al. 2016, J Cell Sci, 129, 1697-710. This leads to aggregated vesicles seen by correlative light and electron microscopy (Vale-Costa et al., 2016 JCS, 129, 1697-710) and by immunofluorescence and FISH (Amorim et al. 2011,. J Virol 85, 4143-4156; Avilov et al. 2012, Vaccine 30, 7411-7417; Chou et al. 2013, PLoS Pathog 9, e1003358; Eisfeld et al. 2011, J Virol 85, 6117-6126 and Lakdawala et al. 2014, PLoS Pathog 10, e1003971.

      To be able to explore the significance of the liquid material properties of IAV inclusions, we used the strategy described in this current work. By developing an effective method to manipulate the material properties of IAV inclusions, we provide evidence that controlled phase transitions can be induced, resulting in decreased vRNP dynamics in cells and a negative impact on progeny virion production. This suggests that the liquid character of liquid inclusions is important for their function in IAV infection. We have improved our explanation addressing this concern in the limitations of our study (as outlined below in the box and in manuscript in lines 857-872).

      We are currently establishing an in vitro reconstitution system to formally demonstrate, in an independent publication, that IAV inclusions are formed by phase separation (or percolation coupled to phase separation). For this future work, we teamed up with Pablo Sartori, a theorical physicist to derive in-depth analysis of the thermodynamics of the viral liquid condensates in the in vitro reconstituted system and compare it to results obtained in the cell. This will provide means to establish comparisons. We think that cells have too many variables to derive meaningful physics parameters (such as entropy and enthalpy) and models that need to be complemented by in vitro systems. For example, increasing the concentration inside a cell is not a simple endeavour as it relies on cellular pathways to deliver material to a specific place. At the same time, the 8 vRNPs, as mentioned above, have different size, valency and RNA sequence and can behave very differently in the formation of condensates and maintenance of their material properties. Ideally, they should be analysed individually or in selected combinations. For the future, we will combine data from in vitro reconstitution systems and cells to address this very important point raised by the reviewer.

      From the paper on the section ‘Limitations of the study’:

      “Understanding condensate biology in living cells is physiological relevant but complex because the systems are heterotypic and away from equilibria. This is especially challenging for influenza A liquid inclusions that are formed by 8 different vRNP complexes, which although sharing the same structure, vary in length, valency, and RNA sequence. In addition, liquid inclusions result from an incompletely understood interactome where vRNPs engage in multiple and distinct intersegment interactions bridging cognate vRNP-Rab11 units on flexible membranes (Chou et al., 2013, Gavazzi et al., 2013, Sugita et al., 2013, Shafiuddin and Boon, 2019, Haralampiev et al., 2020, Le Sage et al., 2020). At present, we lack an in vitro reconstitution system to understand the underlying mechanism governing demixing of vRNP-Rab11a-host membranes from the cytosol. This in vitro system would be useful to explore how the different segments independently modulate the material properties of inclusions, explore if condensates are sites of IAV genome assembly, determine thermodynamic values, thresholds accurately, perform rheological measurements for viscosity and elasticity and validate our findings. The results could be compared to those obtained in cell systems to derive thermodynamic principles happening in a complex system away from equilibrium. Using cells to map how liquid inclusions respond to different perturbations provide the answer of how the system adapts in vivo, but has limitations.

      Reviewer #2 (Public Review):

      During Influenza virus infection, newly synthesized viral ribonucleoproteins (vRNPs) form cytosolic condensates, postulated as viral genome assembly sites and having liquid properties. vRNP accumulation in liquid viral inclusions requires its association with the cellular protein Rab11a directly via the viral polymerase subunit PB2. Etibor et al. investigate and compare the contributions of entropy, concentration, and valency/strength/type of interactions, on the properties of the vRNP condensates. For this, they subjected infected cells to the following perturbations: temperature variation (4, 37, and 42{degree sign}C), the concentration of viral inclusion drivers (vRNPs and Rab11a), and the number or strength of interactions between vRNPs using nucleozin a well-characterized vRNP sticker. Lowering the temperature (i.e. decreasing the entropic contribution) leads to a mild growth of condensates that does not significantly impact their stability. Altering the concentration of drivers of IAV inclusions impact their size but not their material properties. The most spectacular effect on condensates was observed using nucleozin. The drug dramatically stabilizes vRNP inclusions acting as a condensate hardener. Using a mouse model of influenza infection, the authors provide evidence that the activity of nucleozin is retained in vivo. Finally, using a mass spectrometry approach, they show that the drug affects vRNP solubility in a Rab11a-dependent manner without altering the host proteome profile

      The data are compelling and support the idea that drugs that affect the material properties of viral condensates could constitute a new family of antiviral molecules as already described for the respiratory syncytial virus (Risso Ballester et al. Nature. 2021)

      Nevertheless, there are some limitations in the study. Several of them are mentioned in a dedicated paragraph at the end of a discussion. This includes the heterogeneity of the system (vRNP of different sizes, interactions between viral and cellular partners far from being understood), which is far from equilibrium, and the absence of minimal in vitro systems that would be useful to further characterize the thermodynamic and the material properties of the condensates.

      There are other ones.

      We thank reviewer 2 for highlighting specific details that need improving and raising such interesting questions to validate our findings. We have addressed the comments of Reviewer 2, we performed the experiments as described (in blue) below each point raised.

      1) The concentrations are mostly evaluated using antibodies. This may be correct for Cdilute. However, measurement of Cdense should be viewed with caution as the antibodies may have some difficulty accessing the inner of the condensates (as already shown in other systems), and this access may depend on some condensate properties (which may evolve along the infection). This might induce artifactual trends in some graphs (as seen in panel 2c), which could, in turn, affect the calculation of some thermodynamic parameters.

      The concern of using antibodies to calculate Cdense is valid, and we thought it was very important. We addressed this concern by performing the same analyses using a fluorescent tagged virus that has mNeon Green fused to the viral polymerase PA (PA-mNeonGreen PR8 virus). Like NP, PA is a component of vRNPs and labels viral inclusions, colocalising with Rab11 when vRNPs are in the cytosol. However, per vRNP there is only one molecule of PA, whilst of NP there are 37-96 depending on the size of vRNPs. As predicted, we did observe changes in the Cdilute, Cdense and nucleation density. However, the measurements and values obtained for Gibbs free energy, size, aspect ratio detecting viral inclusions with fluorescently tagged vRNPs or antibody staining followed the same trend and allow us to validate our conclusion that major changes in Gibbs free energy occur solely when there is a change in the valency/strength of interactions but not in temperature or concentration (Figure 1 below). Given the extent of these data, we show here the results but, in the manuscript, we will describe the limitations of using antibodies in our study within the section ‘Limitations of the study’ from lines 881-894. Given the importance of the question regarding the pros and cons of the different systems for analysing thermodynamic parameters, we have decided to systematically assess and explore these differences in detail in a future manuscript.

      For more information. This reviewer may be asking why we did not use the PA-fluorescent virus in the first place to evaluate inclusion thermodynamics and avoid problems in accessibility that antibodies may have to get deep into large inclusions. Our answer is that no system is perfect. In the case of the PA-fluorescent virus, the caveats revolve around the fact that the virus is attenuated (Figure 1a below), exhibiting a delayed infection as demonstrated by reduced levels of viral proteins (Figure 1b below). Consistently, it shows differences in the accumulation of vRNPs in the cytosol and viral inclusions form later in infection and the amount of vRNPs in the cytosol does not reach the levels observed in PR8-WT virus. After their emergence, inclusions behave as in the wild-type virus (PR8-WT), fusing and dividing (Figure 1c below) and displaying liquid properties.

      As the overarching goal of this manuscript is to evaluate the best strategies to harden liquid IAV inclusions and given that one of the parameters we were testing is concentration, we reasoned that using PR8-WT virus for our analyses would be reasonable.

      In conclusions, both systems have caveats that are important to systematically assess, and these differences may shift or alter thermodynamic parameters such as nucleation density, inclusion maturation rate, Cdense, Cdilute in particular by varying the total concentration. As a note, to validate all our results using the PA-mNeonGreen PR8 virus, we considered the delayed kinetics and applied our thermodynamic analyses up to 20 hpi rather than 16 hpi.

      However, because of the question raised by this reviewer, on which is the best solution for mitigating errors induced by using antibodies, we re-checked all our data. Not only have we compared the data originated from attenuated fluorescently tagged virus with our data, but also made comparisons with images acquired from Z stacks (as used for concentration and for type/strength of interactions) with those acquired from 2D images. Our analysis revealed that there is a very good match using images acquired with Z-stacks and analysed as Z projections with between antibody staining and vRNP fluorescent virus. Therefore, we re-analysed all our thermodynamic data done with temperature using images acquired from Z stacks and altered entirely Figure 2. We believe that all these comparisons and analyses have greatly improved the manuscript and hence we thank all reviewers for their input.

      Figure 1 – The PA-mNeonGreen virus is attenuated in comparison to the WT virus and data obtained is consistent for Gibbs free energy with analyses done with images processed with antibody fluorescent vRNPs. A. Representation of the PA-mNeonGreen virus (PA-mNG; Abbreviations: NCR: non coding region). B. Cells (A549) were transfected with a plasmid encoding mCherry-NP and co-infected with PA-mNeonGreen virus for 16h, at an MOI of 10. Cells were imaged under time-lapse conditions starting at 16 hpi. White boxes highlight vRNPs/viral inclusions in the cytoplasm in the individual frames. The dashed white and yellow lines mark the cell nucleus and the cell periphery, respectively. The yellow arrows indicate the fission/fusion events and movement of vRNPs/ viral inclusions. Bar = 10 µm. Bar in insets = 2 µm. C-D. Cells (A549) were infected or mock-infected with PR8 WT or PA-mNG viruses, at a multiplicity of infection (MOI) of 3, for the indicated times. C. Viral production was determined by plaque assay and plotted as plaque forming units (PFU) per milliliter (mL) ± standard error of the mean (SEM). Data are a pool from 2 independent experiments. D. The levels of viral PA, NP and M2 proteins and actin in cell lysates at the indicated time points were determined by western blotting. (E-G) Biophysical calculations in cells infected with the PA-mNeonGreen virus upon altering temperature (at 10 hpi, evaluating the concentration of vRNPs (over a time course) in conditions expressing native amounts of Rab11a or overexpressing low levels of Rab11a and upon altering the type/strength of vRNP interactions by adding nucleozin at 10 hpi during the indicated time periods. All data: Ccytoplasm/Cnucleus; Cdense, Cdilute, area aspect ratio and Gibbs free energy are represented as boxplots. Above each boxplot, same letters indicate no significant difference between them, while different letters indicate a statistical significance at α = 0.05 using one-way ANOVA, followed by Tukey multiple comparisons of means for parametric analysis, or Kruskal-Wallis Bonferroni treatment for non-parametric analysis.

      2) Although the authors have demonstrated that vRNP condensates exhibit several key characteristics of liquid condensates (they fuse and divide, they dissolve upon hypotonic shock or upon incubation with 1,6-hexanediol, FRAP experiments are consistent with a liquid nature), their aspect ratio (with a median above 1.4) is much higher than the aspect ratio observed for other cellular or viral liquid compartments. This is intriguing and might be discussed.

      IAV inclusions have been shown to interact with microtubules and the endoplasmic reticulum, that confers movement, and undergo fusion and fission events. We propose that these interactions and movement impose strength and deform inclusions making them less spherical. To validate this assumption, we compared the aspect ratio of viral inclusions in the absence and presence of nocodazole (that abrogates microtubule-based movement). The data in figure 2 shows that in the presence of nocodazole, the aspect ratio decreases from 1.42±0.36 to 1.26 ±0.17, supporting our assumption.

      Figure 2 – Treatment with nocodazole reduces the aspect ratio of influenza A virus inclusions. Cells (A549) were infected with PR8 WT for 8 h and treated with nocodazole (10 µg/mL) for 2h, after which the movement of influenza A virus inclusions was captured by live cell imaging. Viral inclusions were segmented, and the aspect ratio measured by imageJ, analysed and plotted in R.

      3) Similarly, the fusion event presented at the bottom of figure 3I is dubious. It might as well be an aggregation of condensates without fusion.

      We have changed this (check Fig 5A and B in the manuscript), thank you for the suggestion.

      4) The authors could have more systematically performed FRAP/FLAPh experiments on cells expressing fluorescent versions of both NP and Rab11a to investigate the influence of condensate size, time after infection, or global concentrations of Rab11a in the cell (using the total fluorescence of overexpressed GFP-Rab11a as a proxy) on condensate properties.

      We have included a new figure, figure 5 with the suggested data.

    1. Author Response

      Reviewer #2 (Public Review):

      1) The main limitation of this study is that the results are primarily descriptive in nature, and thus, do not provide mechanistic insight into how Ryr1 disease mutations lead to the muscle-specific changes observed in the EDL, soleus and EOM proteomes.

      An intrinsic feature of the high-throughput proteomic analysis technology is the generation of lists of differentially expressed proteins (DEP) in different muscles from WT and mutated mice. Although the definition of mechanistic insights related to changes of dozens of proteins is very interesting, it is a difficult task to accomplish and goes beyond the goal of the high-throughput proteomic analysis presented here. Nevertheless, the analysis of DEPs may indeed provide arguments to speculate on the pathogenesis of the phenotype linked to recessive RyR1 mutations. In the unrevised manuscript, we pointed out that the fiber type I predominance observed in congenital myopathies linked to recessive Ryr1 mutation are consistent with the high expression level of heat shock proteins in slow twitch muscles. However, as suggested by Reviewer 3, we have removed "vague statements" from the text of the revised manuscript, concerning major insights into pathophysiological mechanisms, since we are aware that the mechanistic information, if any, that we can extract from the data set, cannot go over the intrinsic limitation of the high-throughput proteomic technology.

      b) Results comparing fast twitch (EDL) and slow twitch (soleus) muscles from WT mice confirmed several known differences between the two muscle types. Similar analyses between EOM/EDL and EOM/soleus muscles from WT mice were not conducted.

      We agree with the point raised by the Reviewer. In the revised manuscript we have changed Figure 2. The new Figure 2 shows the analysis of differentially expressed proteins in EDL, soleus and EOMs from WT mice. We have also added 2 new Tables (new Supplementary Table 2 and 3) and have inserted our findings in the revised Results section (page, 7, lines 157-176, pages 8 and 9).

      c) While a reactome pathway analysis for proteins changes observed in EDL is shown in Supplemental Figure 1, the authors do not fully discuss the nature of the proteins and corresponding pathways impacted in the other two muscle groups analyzed.

      We have now included in the revised manuscript a new Figure 2 which includes the Reactome pathway analysis comparing EDL with soleus, EDL with EOM and soleus with EOM (panels C, F and I, respectively). We have also inserted into the revised manuscript a brief description of the pathways showing the greatest changes in protein content (page 7 line 156-175, pages 8 and 9). We agree that the data showing changes in protein content between the 3 muscle groups of the WT mice are important also because they validate the results of the proteomic approach. Indeed, the present results confirm that many proteins including MyHCIIb, calsequestrin 1, SERCA1, parvalbumin etc are more abundantly expressed in fast twitch EDL muscles compared to soleus. Similarly, our results confirm that EOMs are enriched in MyHC-EO as well as cardiac isoforms of ECC proteins. This point has been clarified in the revised version of the manuscript (page 8, lines 198-213; page 9 lines 214-228). Nevertheless, we would like to point out that the main focus of our study is to compare the changes of protein content induced by the presence of recessive RyR1 mutations.

      Reviewer #3 (Public Review):

      a) it would be useful to determine whether changes in protein levels correlated with changes in mRNA levels …….

      We performed qPCR analysis of Stac3 and Cacna1s in EDL, Soleus and EOM from WT mice (see Figure 1 below). The expression of transcripts encoding Cacna1s and Stac3 is approximately 9-fold higher in EDL compared to Soleus. The fold change of Stac3 and Cacna1s transcripts in EDL muscles is higher compared to the differences we observed by Mass spectrometry at the protein level between EDL and Soleus. Indeed, we found that the content of the Stac3 protein in EDL is 3-fold higher compared to that in soleus. Although there is no apparent linear correlation between mRNA and protein levels, we believe that a few plausible conclusions can be drawn, namely: (i) the expression level of both transcripts and proteins is higher EDL compared to EOM and soleus muscles, respectively, (ii) the expression level of transcripts encoding Stac3 correlate with those encoding Cacan1s and confirm proteomic data. In addition, the level of Stac3 transcript does not changes between WT and dHT, confirming our proteomic data which show that Stac3 protein content in muscles from dHT is similar to that found in WT littermates. Altogether these results support the concept that the differences in Stac3 content between EDL and soleus occur at both the protein and transcript levels, namely high Stac3 mRNA level correlates with higher protein content (EDL) and low mRNA levels correlated with low Stac3 protein content in Soleus muscles (see Figure 1 below).

      Figure 2: qPCR of Cacna1s and Stac3 in muscles from WT mice. The expression levels of the transcripts encoding Cacna1s and Stac3 are the highest in EDL muscles and the lowest in soleus muscles (top panels). There are no significant changes in their relative expression levels in dHT vs WT. Each symbol represents the value from of a single mouse. * p=0.028 Mann Whitney test qPCR was performed as described in Elbaz et al., 2019 (Hum Mol Genet 28, 2987-2999).

      ….and whether or not the protein present was functional, and whether Stac3 was in fact stoichiometrically depleted in relation to Cacna1s.

      We thought about this point but think that there are no plausible arguments to believe that Stac3 is not functional, one simple reason being that our WT mice do not have a phenotype which would be associated with the absence of Stac3 (Reinholt et al., PLoS One 8, e62760 2013, Nelson et al. Proc. Natl. Acad. Sci. USA 110:11881 2013).

      b) In the abstract, the authors stated that skeletal muscle is responsible for voluntary movement. It is also responsible for non-voluntary. The abstract needs to be refocused on the mutation and on what we learn from this study. Please avoid vague statements like "we provide important insights to the pathophysiological mechanisms..." mainly when the study is descriptive and not mechanistic.

      The abstract of the revised manuscript has been rewritten. In particular, we removed statements referring to important “pathophysiological mechanistic insight”.

      c) The author should bring up the mutation name, location and phenotype early in the introduction.

      In the revised manuscript we provide the information requested by the Reviewer (page 2 lines 36-38 and page 4, lines 98-102).

      d) This reviewer also suggests that the authors refocus the introduction on the mutation location in the 3D RyR1 structure (available cryo-EM structure), if there is any nearby ligand binding site, protomers junction or any other known interacting protein partners. This will help the reader to understand how this mutation could be important for the channel's function

      The residue Ala4329 is present inside the TMx (Auxiliary transmembrane helices) domain which spans from residue 4322 to 4370 and interposes structurally (des Georges A et al. 2016 Cell 167,145-57; Chen W, et al. 2020 EMBO Rep. 21, e49891). Although the structural resolution of the region has been improved (des Georges et al, 2016), parts of the domain still remain with no defined atomic coordinates, especially the region encompassing a.a. E4253 – F4540. Because of such undefined atomic coordinates of the region E4253-F4540, we are not able to determine the real orientation and the disposition of the amino acids in this region, including the A4329 residue. As reference, structure PDB: 5TAL of des Georges et al, 2016 was analyzed with UCSF Chimera (production version 1.16) (Pettersen et al. J. Comput. Chem. 25: 1605-1612. doi: 10.1002/jcc.20084).

    1. Author Response:

      Reviewer #1 (Public Review):

      In this study, Kuppan, Mitrovich, and Vahey investigated the impact of antibody specificity and virus morphology on complement activation by human respiratory syncytial virus (RSV). By quantifying the deposition of components of the complement system on RSV particles using high-resolution fluorescence microscopy, they found that antibodies that bind towards the apex of the RSV F protein in either the pre- or post-fusion conformation activated complement most efficiently. Additionally, complement deposition was biased towards globular RSV particles, which were frequently enriched in F in the post-fusion conformation compared to filamentous particles on which F exists predominantly in the pre-fusion conformation.

      Strengths:

      1) While many previous studies have examined the properties of antibodies that impact Fc-mediated effector functions, this study offers a conceptual advance in its demonstration that heterogeneity in virus particle morphology impacts complement activation. This novel finding will motivate further research on this topic both in the context of RSV and other viral infections.

      2) The use of site-specific labeling of viral proteins and high-resolution fluorescence microscopy represents a technical advance in monitoring interactions among different components of antiviral immune responses at the level of single virus particles.

      3) The paper is well written, data are clearly presented and support key claims of the paper with caveats appropriately acknowledged.

      We appreciate the reviewer’s supportive comments. In our revised manuscript, we have focused on improving clarity regarding the minor weaknesses noted below.

      Minor weaknesses:

      Working models and their implications could be clarified and extended. Specifically:

      1) The finding that globular particles enriched in F proteins in the post-fusion conformation (Fig 3F) are dominant targets of complement activation as measured by C3 deposition by not only post-F- but also pre-F-specific antibodies (Fig 4B, left) is interesting. This is despite the fact that, as expected, pre-F antibodies bind less efficiently to globular particles (Fig 4B, right). How do the authors reconcile these observations, given that C3 deposition seems to be IgG-concentration-dependent (Fig 2E)?

      The reviewer raises an excellent point: globular particles, which accumulate as the virus ages, contain more post-F and less pre-F than particles that have recently been shed from infected cells. These ‘aged’ particles nonetheless accumulate more C3 when incubated with pre-F mAbs than ‘younger’ particles, where the proportion of pre-F is higher. We attribute this to the lower surface curvature of globular particles: they accumulate more C3 in the presence of pre-F mAbs in spite of the reduced availability of pre-F epitopes. Figure 1C and 1F help to support this point. This data shows C3 deposition driven by different antibodies bound to particles enriched in either pre-F (Figure 1C) or post-F (Figure 1F). Importantly, for this experiment the conversion to post-F was driven in such a way that virion morphology is preserved (Figure 1E). In this case, we see a clear reduction in C3 deposition by pre-F mAbs on post-F particles (e.g. for CR9501, the percentage of C3-positive particles drops from 24% on pre-F virus to 6% on post-F-enriched virus). This demonstrates that, in the absence of other changes, conversion of pre-F to post-F reduces complement deposition by pre-F specific mAbs.

      Similarly, the reviewer correctly points out that reduced levels of antibody binding lead to lower levels of C3 deposition (Figure 2E); however, as in Figure 1, this data is collected from particles with the same morphologies. Thus, in the absence of additional factors, reduction in mAbs bound to pre-F leads to a reduction in C3 deposition driven by these mAbs. The fact that we observe the opposite trend when changes in particle morphology accompany changes in post-F abundance points to an important role for particle shape in activation of the classical pathway.

      2) Based on data in Figure 5-figure supplement 2, the authors argue that "large viruses are poised to evade complement activation when they emerge from cells as highly-curved filaments, but become substantially more susceptible as they age or their morphology is physically disrupted." Could the increase in C3 deposition be alternatively explained by a higher density of F proteins on larger particles instead of / in addition to a larger potential decrease in membrane curvature?

      We agree that the density of F on a virus – the number of F trimers per unit surface area - likely contributes to the efficiency of C3 deposition. In Figure 6 – figure supplement 2 (Figure 5 – figure supplement 2 in the original submission), we control for this potential effect by comparing viruses that have the same amount of F (as measured by fluorescence intensities of SrtA-labeled F) that are either in filamentous form or globular form (induced through osmotic swelling). The total amount of F per virus is preserved during swelling, and the membrane surface area will remain constant due to the limited ability of lipid bilayers to stretch7. As a result, the input material for these comparisons is the same in terms of F trimers per unit area, yet the C3:F ratio differs substantially. This leads us to conclude that the differences must be attributable to factors other than the density of F. Importantly, this does not mean that the amount of F per unit surface area does not matter for C3 deposition – only that this is not the effect we are observing here. We have added text (Line 299) to help clarify this point: “This effect is unlikely to arise due to changes in the abundance or density of F in the viral membrane, both of which will remain constant following swelling. Similarly, it does not appear to be purely related to size, as larger viral filaments show similar C3:F ratios as smaller viral filaments.”

      3) In the discussion, the authors acknowledge that the implications based on the findings are speculative. However, more clarity on the basis of these speculative models would be useful. For example, it is not clear how the findings directly inform the presented model of immunodominance hierarchies in infants.

      We agree that this was unclear in the original manuscript. We have rewritten paragraph 4 of the Discussion to clarify how our results may contribute to the changes in immunodominance that have been observed in RSV between infants and adults.

      Reviewer #2 (Public Review):

      This is an intriguing study that investigates the role of virus particle morphology on the ability of the first few components in the complement pathway to bind and opsonize RSV virions. The authors use primarily fluorescence microscopy with fluorescently tagged F proteins and fluorescently labeled antibodies and complement proteins (C3 and C4). They observed that antibodies against different epitopes exhibited different abilities to induce C3 binding, with a trend reflecting positioning of IgG Fc more distal to the viral membrane resulting in better complement "activation". They also compared the ability of C3 to deposit on virus produced from cells +/- CD55, which inhibits opsonization, and showed knockout led to greater C3 binding, indicating a role for this complement "defense protein" in RSV opsonization. They also examined kinetics of complement protein deposition (probed by C4 binding) to globular vs filamentous particles, observing that deposition occurred more rapidly to non-filaments.

      A better understanding of complement activation in response to viruses can lead to a more comprehensive understanding of the immune response to antigen both beneficial and detrimental, when dysfunctional, during infection as well as mechanisms of combating the viral infection. The study provides new mechanistic information for understanding the properties of an enveloped virus that can influence complement activation, at least in an in vitro setting. It remains to be determined whether these effects manifest in the considerably more complex setting of natural infection or even in the presence of a polyclonal antibody mixture.

      The studies are elegantly designed and carefully executed with reasonable checks for reproducibility and controls, which is important especially in a relatively complex and heterogeneous experimental system.

      We thank the reviewer for the insightful comments. We have revised the manuscript to help to clarify points of confusion and to address some of the technical points raised here.

      Specific points:

      1) "Complement activation" involves much more than C3 or C4 binding. Better to use more specific terminology relating to the observable (i.e. fluorescently labeled complement component binding)

      We agree with the reviewer. We have revised the manuscript throughout to make our language more accurate and precise.

      2) What is the rationalization for concentrations of antibodies used? What range was tested, and how dependent on antibody concentration were the observed complement deposition trends? How do they relate to physiological concentrations, and how would the presence of a more complex polyclonal response that is typically present (e.g. as the authors noted, the serum prior to antibody depletion already mediates complement activation) affect the complement activation trends? The neat, uniform display of Fc for monoclonals that were tested is likely to be quite garbled in more natural antibody response situations. This should be discussed.

      We have added discussion of antibody concentrations and possible differences between monoclonal and polyclonal responses to the revised manuscript. Below, we address the specific questions raised here by the reviewer.

      We chose to use antibody concentrations that are comparable to the concentrations of dominant clonotypes in post-vaccination serum1. Our goal in selecting relatively high antibody concentrations for our experiments was to focus on understanding the capacity of an antibody to drive complement deposition when it has reached maximum densities on RSV particles. This is discussed starting on Line 125 of Results, and in paragraph 2 of Discussion. Experiments testing a range of antibody concentrations would be valuable, but are likely to strongly reflect differences in the binding affinities of these antibodies, which have been characterized previously.

      Although we have not performed titrations for each of the antibodies tested due to the large number of conditions needed and the limited throughput of our experimental approach, the manuscript does present a dilution series for CR9501, the IgG1 mAb with the greatest potency in driving C3 deposition among those tested here. This data (shown in Figure 3E & F in the revised manuscript) shows that as the amount of antibody added in solution decreases over a 16-fold range, C3 deposition decreases as well. The decrease in C3 deposition is roughly commensurate with the reduction in antibody binding, reaching levels that are just above background at an antibody concentration of ~0.6μg/ml (1:800 dilution). We think it is likely that other activating antibodies would show similar trends, while antibodies that do not activate the classical pathway at saturating concentrations would be unlikely to do so across a range of lower concentrations.

      We agree with the reviewer that complement deposition driven by polyclonal antibodies is more complex than the monoclonal responses studied here. As discussed in paragraph 2 of our revised Discussion, one effect that polyclonal serum might have is to increase the density of Fcs on the virus by providing antibody mixtures that bind to multiple non-overlapping antigenic sites. We speculate that this would generally increase complement deposition, provided that sufficient antibodies are present that bind to productive antigenic sites (e.g. sites 0/ , II, and V).

      Finally, we note that we observe a similar phenomenon where globular particles are preferentially opsonized with C3 in our experiments with polyclonal serum where IgG and IgM have not been depleted (Figure R1). The major limitation of this data – which is resolved by using monoclonal antibodies – is the difficulty of determining to what extent this bias arises due to the epitopes targeted by the polyclonal serum versus the intrinsic sensitivity of the virus particles.

      Figure R1: RSV opsonized with polyclonal human serum. A similar bias towards globular particles (white dashed circles) is observed as in experiments with monoclonal antibodies.

      3) Are there artifacts or caveats resulting from immobilization of virus particles on the coverslips?

      As pointed out by the reviewer, a few possible artifacts or caveats could arise due to the immobilization of viruses on coverslips. These include (1) spurious binding of C1 or other complement components to the immobilizing antibody (3D3); (2) reduced access to viral antigens as a result of immobilization; and (3) inhibition of antibody-induced viral aggregation. We are able to rule out issues associated with (1), because we do not see attachment of C1 or C3 to the coverslip (i.e. outside regions occupied by virus particles). This is consistent with the fact that the antibodies are immobilized on the surface via a C-terminal biotin attached to the heavy chain, which would limit access for C1 binding and prevent the formation of Fc hexamers.

      Immobilization on coverslips could reduce the accessibility of a portion of the virus for binding by antibodies and complement proteins. This could effectively shield a portion of the viral surface from assembly of an activating complex, which we estimate requires ~35nm of clearance above the targeted epitope on F8. Importantly, the fraction of the viral surface area that would be shielded would vary for filaments and spheres; to determine if this could influence our results, we calculated the expected magnitude of this effect (Figure R2). To do this, we modeled the virus as being tethered to the surface via a 25nm linkage. This accounts for the length of the biotinylated PEG (~5-15nm for PEG2K, depending on the degree of extension), streptavidin (~5nm), and the anti-G antibody (~10-15nm including the biotinylated C-terminal linker). Although limited structural information is available for RSV G, the ~100 residue, heavily glycosylated region between the viral membrane and the 3D3 epitope likely extends above the height of F (~12nm). Our model assumes that a shell of thickness d surrounding the virus is necessary for antibody-C1 complexes to fit without clashing with the surface (this shell is shaded in gray in the schematic from Figure R2). Tracing the angles at which this shell clashes with the coverslip allows us to calculate the fraction of total surface area that is inaccessible for activation of the classical pathway. The results are plotted on the right side of Figure R2. The relative surface area accessible to a 35nm activating antibody-C1 complex differs between a filament and a sphere of equivalent surface area by about 15%. We conclude that this difference is modest compared to the ~5-fold difference in deposition kinetics we observe between viral filaments and spheres (Figure 4), or the 3- to 10-fold difference in relative C3 deposition we observe on larger filamentous particles after conversion to spheres (Figure 6 – figure supplement 2C).

      Finally, by performing experiments on immobilized viruses, we eliminate the possibility for antibody-dependent particle aggregation. While this was necessary for us to get interpretable results, the formation of viral aggregates could affect the dynamics and extent of complement deposition. For example, activation of the classical pathway on one particle in an aggregate could spread to non-activating particles through a “bystander effect”, as has been reported in other contexts9. We are interested in this question and have begun preliminary experiments in this direction; however, we believe that a definitive answer is outside the scope of this current work. To alert readers to this consideration, we have added this to paragraph 2 of the revised Discussion (Line 359).

      Figure R2: Estimating the surface accessibility of RSV particles bound to coverslips. Definition of variables: af: radius of cylindrical RSV filament; as: radius of spherical RSV particle of equivalent surface area (see Figure 6 – figure supplement 2A); d: distance needed above the viral surface to accommodate IgG-C1 activating complexes; h: height of viral surface above the coverslip; L: length of the viral filament.

      4) How is the "density of antigen" quantitated? What fraction of F or G is labeled? For fluorescence intensity measurements in general, how did the authors ensure their detection was in a linear sensitivity range for the detectors for the various fluorescent channels? Since quantitation of fluorescence intensities is important in this study, some discussion in methods would be valuable.

      We have performed this important additional characterization of our fluorescence system and our overall labeling and quantification strategy to address these concerns. The results of this characterization are now included in two new figure supplements in the revised manuscript (Figure 1 – figure supplements 2 & 3).

      5) The authors also show that the particle morphology, whether globular or filamentous, as well as relative size and resulting apparent curvature, correlate with ability of C3 to bind. Some link to the abundance of post-fusion F (post-F) is examined and discussed, but I found the back and forth discussion between morphology, C3 binding, and post-F abundance to be confusing and in need of clarification and streamlining. Is there a mechanistic link between morphology changes and post-F level increases? Are the two linked or coincidental (for example does pre-F interaction with matrix help stabilize that conformation, and if lost lead to spontaneous conversion to post-F?). Please clarify.

      Specifically, we have separated the discussion of pre-F versus post-F abundance and particle morphology into two different sections in Results, and we have rearranged Figures 4 and 5 (Figures 3 and 4 in the original submission) to improve clarity.

      Regarding the question of whether changes in morphology and the pre-F to post-F conversion are coincidental or mechanistically linked: the answer is not entirely clear, although we have collected new data that suggests a connection. We first want to note that the two effects are at least partly separable: brief treatment with a low osmolarity solution causes particle shape to change while preserving pre-F (Figure 6A & B), whereas treating with an osmotically balanced solution with low ionic strength converts pre-F to post-F without affecting virus shape (Figure 1E). However, we were motivated by the reviewer’s questions to look into this further. To determine if the change in viral shape may serve to destabilize the pre-F conformation over time, we compared the relative amounts of pre-F and post-F present in particles that were osmotically swollen to those that were not at 0h and at 24h. In these experiments, particles were swollen using a brief (~1 minute) exposure to low osmolarity conditions before returning them to PBS (Figure R3, left). As expected, we observe no immediate change in pre-F abundance following the brief osmotic shock (Figure R3, right: 0h time point), consistent with Figure 6B. After incubating the particles an additional 24h at 37oC, the post-F-to-pre-F ratio is ~3.5-fold higher in osmotically-swollen particles than in those where filamentous morphology was initially preserved (Figure R3, right: 24h time point). This supports the reviewer’s suggestion that interactions with the matrix may help to stabilize F in the prefusion conformation, since the conversion to post-F is faster when this interaction is disrupted. Whether or not this has any relevance for RSV entry into cells remains to be determined; however, it is worth noting that we observed no clear loss or gain of infectivity in RSV particles following osmotic swelling (Figure 6 – figure supplement 1A). Since this result may be of interest to readers, we have included this new data in Figure 6 – figure supplement 1B, and it is discussed briefly in Results (Line 250).

      Figure R3: Determining stability of pre-F following matrix detachment. Left: Experimental design. Right: Comparison of pre-F stability on untreated particles (gray) and particles subjected to brief osmotic swelling (magenta). Distributions show the ratio of post-F (ADI-14353) to pre-F (5C4) intensities per particle, combined for four biological replicates, sampled at 0h (immediately after swelling) and after an additional incubation at 37oC for 24h. Black points show median values for each individual replicate. P-values are determined from a two-sample T test.

      6) Since their conclusion is that curvature of the virus surface is a major influence on the ability of complement proteins to bind, I feel that some effort at modeling this effect based upon known structures is warranted. One might also anticipate then that there would be some epitope-dependent effect as a result of changes in curvature that may lead to an exaggeration of the epitope-specific effects for more highly curved particles perhaps than those with lower curvature? Is this true?

      The reviewer raises two excellent points: that it may be possible to gain insight into the mechanisms through which curvature dictates C1 binding and other aspects of complement activation through structural modeling, and that such a model may help to identify specific epitope effects that could contribute to curvature dependence.

      We developed simulations based on the geometry of RSV, F, and hexameric IgG to try to better understand how curvature may influence initiation of the classical pathway. This model is described in the Methods section (Modeling IgG hexamers on curved surfaces), and the results are discussed in the final two paragraphs of the Results section. In addition, we have included a new figure (Figure 7) to summarize the model’s predictions. This model corroborates the curvature sensitivity of IgG hexamer formation and suggests a possible intuitive explanation for our findings: high curvature effectively increases the distance between epitopes that sit high above the viral membrane, decreasing the likelihood of hexamer formation (Figure 7D). Regarding epitope specific effects, this model suggests that the further the epitope is above the viral membrane, the greater the effect that decreasing curvature will have. However, we find that epitopes closer to the membrane (e.g. those bound by 101F or ADI-19425) are overall very inefficient at activating the classical pathway, potentially due to steric obstruction of the formation of IgG hexamers. Thus, there may be an inherent tradeoff between overcoming steric obstruction (by binding to epitopes distal to the membrane) and sensitivity to surface curvature.

      It is important to note that this model is reductionist and does not include detailed structural information. Additional factors may be important for considering epitope-specific effects. For example, antibodies that bind equatorially on F (e.g. ADI-19425, which binds to antigenic site III), show minimal complement deposition in our experiments. However, particles whose curvature approaches the diameter of hexameric IgG or IgM (~20nm) may display these epitopes in a manner that is more accessible. If the curvature necessary to observe such an effect falls outside of the biologically accessible range, it would not be observable in our experiments. Nonetheless, it is possible that a different set of antibodies may drive complement deposition on highly-curved nanoparticle vaccines that are in development10. We have added this important point to the second paragraph of the Discussion.

      7) Line 265: it would be useful to confirm the increase C1 binding as a function of morphology as was done for antibody-angle of binding experiments.

      We believe that this data is shown in Figure 6B (Figure 5B in the original manuscript).

      Reviewer #3 (Public Review):

      Overall the manuscript is clearly written and the data are displayed well, with helpful diagrams in the figures to illustrate assays and RSV F epitopes. The engineering of the RSV strain to include a fluorescent reporter and tags on F and G that serve as substrates for fluorophore attachment is impressive and is a strength. The RSV literature is well cited and the interpretation of the results is consistent with structure/function data on RSV F and its interaction with antibodies. This reviewer is not an expert on the experiments performed in this manuscript, but they appear to be rigorously performed with appropriate controls. As such, the conclusions are justified by the data. One weakness is the extent to which the results regarding virion morphology are biologically relevant. Non-filamentous forms of the virion are generally obtained only in vitro as a result of virion purification or biochemical treatment. However, these results may be relevant for certain vaccine candidates, including the failed formalin-inactivated RSV vaccine that was evaluated in the late 1960s and caused vaccine-enhanced disease upon natural RSV infection.

      Thank you for these suggestions, which have helped us to better place our results regarding RSV morphology in the context of prior work. We agree with the reviewer that non-filamentous RSV particles are commonly obtained in vitro, and that this morphology does not reflect the structure of the virus as it is budding from infected cells. Our work has characterized the transition from filament to globular / amorphous form, with the finding that it can occur rapidly upon physical or chemical perturbations, as well as more gradually during natural aging: i.e. in the absence of handling or purification. We are also able to detect globular particles accumulating in cultured A549 cells, where no handling has occurred prior to observation (Figure 5 – figure supplement 1). While we do not currently know how well this reflects the tendency of RSV to undergo conversion from filament to sphere in vivo, we propose that it is plausible that such a transformation could occur. To distinguish between what we demonstrate and what we speculate, we write (Line 401): “Although more work is needed to understand the prevalence of globular particles during in vivo infection, our observations that these particles accumulate over time through the conversion of viral filaments – even under normal cell culture conditions - suggest that their presence in vivo is feasible, where the physical and chemical environment would be considerably harsher and more complex.”

      We agree with the reviewer that our results may have relevance towards understanding the failed formalin-inactivated vaccine trial. We have added this to paragraph 5 of the Discussion section.

    1. Author Response

      Public Evaluation Summary:

      The authors re-analyzed a previously published dataset and identify patterns suggestive of increased bacterial biodiversity in the gut may creating new niches that lead to gene loss in a focal species and promote generation of more diversity. Two limitations are (i) that sequencing depth may not be sufficient to analyze strain-level diversity and (ii) that the evidence is exclusively based on correlations, and the observed patterns could also be explained by other eco-evolutionary processes. The claims should be supported by a more detailed analysis, and alternative hypotheses that the results do not fully exclude should be discussed. Understanding drivers of diversity in natural microbial communities is an important question that is of central interest to biomedically oriented microbiome scientists, microbial ecologists and evolutionary biologists.

      We agree that understanding the drivers of diversity in natural communities is an important and challenging question to address. We believe that our analysis of metagenomes from the gut microbiomes is complementary to controlled laboratory experiments and modeling studies. While these other studies are better able to establish causal relationships, we rely on correlations – a caveat which we make clear, and offer different mechanistic explanations for the patterns we observe.

      We also mention the caveat that we are only able to measure sub-species genetic diversity in relatively abundant species with high sequencing depth in metagenomes. These relatively abundant species include dozens of species in two metagenomic datasets, and we see no reason why they would not generalize to other members of the microbiome. Nonetheless, further work will be required to extend our results to rarer species.

      Our revised manuscript includes two major new analyses. First, we extend the analysis of within-species nucleotide diversity to non-synonymous sites, with generally similar results. This suggests that evolutionarily older, less selectively constrained synonymous mutations and more recent non-synonymous mutations that affect protein structure both track similarly with measures of community diversity – with some subtle differences described in the manuscript.

      Second, we extend our analysis of dense time series data from one individual stool donor and one deeply covered species (B. vulgatus) to four donors and 15 species. This allowed us to reinforce the pattern of gene loss in more diverse communities with greater statistical support. Our correlational results are broadly consistent with the predictions of DBD from modeling and experimental studies, and they open up new lines of inquiry for microbiome scientists, ecologists, and evolutionary biologists.

      Reviewer #1 (Public Review):

      This paper makes an important contribution to the current debate on whether the diversity of a microbial community has a positive or negative effect on its own diversity at a later time point. In my view, the main contribution is linking the diversity-begets-diversity patterns, already observed by the same authors and others, to genomic signatures of gene loss that would be expected from the Black Queen Hypothesis, establishing an eco-evolutionary link. In addition, they test this hypothesis at a more fine-grained scale (strain-level variation and SNP) and do so in human microbiome data, which adds relevance from the biomedical standpoint. The paper is a well-written and rigorous analysis using state-of-the-art methods, and the results suggest multiple new experiments and testable hypotheses (see below), which is a very valuable contribution.

      We thank the reviewer for their generous comments.

      That being said, I do have some concerns that I believe should be addressed. First of all, I am wondering whether gene loss could also occur because of environmental selection that is independent of other organisms or the diversity of the community. An alternative hypothesis to the Black Queen is that there might have been a migration of new species from outside and then loss of genes could have occurred because of the nature of the abiotic environment in the new host, without relationship to the community diversity. Telling the difference between these two hypotheses is hard and would require extensive additional experiments, which I don't think is necessary. But I do think the authors should acknowledge and discuss this alternative possibility and adjust the wording of their claims accordingly.

      We concur with the reviewer that the drivers of the correlation between community diversity and gene loss are unclear. Therefore, we have now added the following text to the Discussion:

      “Here we report that genome reduction in the gut is higher in more diverse gut communities. This could be due to de novo gene loss, preferential establishment of migrant strains encoding fewer genes, or a combination of the two. The mechanisms underlying this correlation remain unclear and could be due to biotic interactions – including metabolic cross-feeding as posited by some models (Estrela et al., 2022; San Roman and Wagner, 2021, 2018) but not others (Good and Rosenfeld, 2022) – or due to unknown abiotic drivers of both community diversity and gene loss.”

      Additionally, we have revised Figure 1 to show that strain invasions/replacements, in addition to evolutionary change, could be an important driver of changes in intra-species diversity in the microbiome.

      Another issue is that gene loss is happening in some of the most abundant species in the gut. Under Black Queen though, we would expect these species to be most likely "donors" in cross-feeding interactions. Authors should also discuss the implications, limitations, and possible alternative hypotheses of this result, which I think also stimulates future work and experiments.

      We thank the reviewer for raising this point. It is unclear to us whether the more abundant species would be donors in cross-feeding interactions. If we understand correctly, the reviewer is suggesting that more abundant donors will contribute more total biomass of shared metabolites to the community. This idea makes sense under the assumption that the abundant species are involved in cross-feeding interactions in the first place, which may or may not be the case. As our work heavily relies on a dataset that we previously analyzed (HMP), we wish to cite Figure S20 in Garud, Good et al. 2019 PLoS Biology in which we found there are comparable rates of gene changes across the ~30 most abundant species analyzed in the HMP. This suggests that among the most abundant species analyzed, there is no relationship between their abundance and gene change rate.

      That being said, we acknowledge that our study is limited to the relatively abundant focal species and state now in the Discussion: “Deeper or more targeted sequencing may permit us to determine whether the same patterns hold for rarer members of the microbiome.”

      Regarding Figure 5B, there is a couple of questions I believe the authors should clarify. First, How is it possible that many species have close to 0 pathways? Second, besides the overall negative correlation, the data shows some very conspicuous regularities, e.g. many different "lines" of points with identical linear negative slope but different intercept. My guess is that this is due to some constraints in the pathway detection methods, but I struggle to understand it. I think the authors should discuss these patterns more in detail.

      We sincerely thank the reviewer for raising this issue, as it prompted us to investigate more deeply the patterns observed at the pathway level. In short, we decided to remove this analysis from the paper because of a number of bioinformatics issues that we realized were contributing to the signal. However, in support of BQH-like mechanisms at play, we do find evidence for gene loss in more diverse communities across multiple species in both the HMP and Poyet datasets. Below we detail our investigation into Figure 5b and how we arrived at the conclusion that is should be removed:

      (1) Regarding data points in Figure 5B where many focal species have “zero pathways”,we firstly clarify how we compute pathway presence and richness. Pathway abundance data per species were downloaded from the HMP1-2 database, and these pathway abundances were computed using HUMAnN (HMP Unified Metabolic Analysis Network). According to HUMAnN documentation, pathway abundance is proportional to the number of complete copies of the pathway in the community; this means that if at least one component reaction in a certain pathway is missing coverage (for a sample-species pair), the pathway abundance may be zero (note that HUMAnN also employs “gap filling” to allow no more than one required reaction to have zero abundance). As such, it is likely that insufficient coverage, especially for low-abundance species, causes many pathways to report zero abundance in many species in many samples. Indeed, 556 of the 649 species considered had zero “present” pathways (i.e. having nonzero abundance) in at least 400 of the 469 samples (see figure below).

      (2) We thank the reviewer for pointing out the “conspicuous regularities” in Figure 5B,particularly “parallel lines” of data points that we discovered are an artifact of the flawed way in which we computed “community pathway richness [excluding the focal species].” Each diagonal line of points corresponds to different species in the same sample, and because community pathway richness is computed as the total number of pathways [across all species in the sample] minus the number of pathways in the focal species, the current Figure 5B is really plotting y against X-y for each sample (where X is a sample’s total community pathway richness, and y is the pathway richness of an individual species in that sample). This computation fails to account for the possibility that a pathway in an excluded focal species will still be present in the community due to redundancy, and indeed BQH tests for whether this redundancy is kept low in diverse communities due to mechanisms such as gene loss.

      We attempted to instead plot community pathway richness defined as the number of unique pathways covered by all species other than the focal species. This is equivalent to [number of unique pathways across all species in a sample] minus the [number of pathways that are ONLY present in the focal species and not any other species in the sample]. However, when we recomputed community pathway richness this way, it is rare that a pathway is present in only one species in a sample. Moreover, we find that with the exception of E. coli, focal species pathway richness tended to be very similar across the 469 samples, often reaching an upper limit of focal species pathway richness observed. (It is unclear to what extent lower pathway richnesses are due to low species abundance/low sample coverage versus gene loss). This new plot reveals even more regularities and is difficult to interpret with respect to BQH. (Note that points are colored by species; the cluster of black dots with outlying high focal pathway richness corresponds to the “unclassified” stratum which can be considered a group of many different species.)

      Overall, because community pathway richness (excluding a focal species) seems to primarily vary with sample rather than focal species in this dataset when using the most simple/strict definition of community pathway richness as described above, it is difficult to probe the Black Queen Hypothesis using a plot like Figure 5B. As pointed out by reviewers, lack of sequencing depth to analyze strain-level diversity and accurately quantify pathway abundance, irrespective of species abundance, seems to be a major barrier to this analysis. As such, we have decided to remove Figure 5B from the paper and rewrite some of our conclusions accordingly.

      Finally, I also have some conceptual concerns regarding the genomic analysis. Namely, genes can be used for biosynthesis of e.g. building blocks, but also for consumption of nutrients. Under the Black Queen Hypothesis, we would expect the adaptive loss of biosynthetic genes, as those nutrients become provided by the community. However, for catabolic genes or pathways, I would expect the opposite pattern, i.e. the gain of catabolic genes that would allow taking advantage of a more rich environment resulting from a more diverse community (or at least, the absence of pathway loss). These two opposing forces for catabolic and biosynthetic genes/pathways might obscure the trends if all genes are pooled together for the analysis. I believe this can be easily checked with the data the authors already have, and could allow the authors to discuss more in detail the functional implications of the trends they see and possibly even make a stronger case for their claims.

      We thank the reviewer for their suggestion. As explained above, we have removed the pathway analysis from the paper due to technical reasons. However, we did investigate catabolic and biosynthetic pathways separately as suggested by the reviewer as we describe below:

      We obtained subsets of biosynthetic pathways and catabolic pathways by searching for keywords (such as “degradation” for catabolic) in the MetaCyc pathway database. After excluding the “unclassified” species stratum, we observe a total of 279 biosynthetic and 167 catabolic pathways present in the HMP1-2 pathway abundance dataset. Using the corrected definition of community pathway richness excluding a focal species, for each pathway type—either biosynthetic or catabolic—we plotted focal species pathway richness against community pathway richness including all pathways regardless of type:

      We observe the same problem where, within a sample, community pathway richness excluding the focal species hardly varies no matter which focal species it is, due to nearly all of its detected pathways being present in at least one other species; this makes the plots difficult to interpret.

      Reviewer #2 (Public Review):

      The authors re-analysed two previously published metagenomic datasets to test how diversity at the community level is associated with diversity at the strain level in the human gut microbiota. The overall idea was to test if the observed patterns would be in agreement with the "diversity begets diversity" (DBD) model, which states that more diversity creates more niches and thereby promotes further increase of diversity (here measured at the strain-level). The authors have previously shown evidence for DBD in microbiomes using a similar approach but focusing on 16S rRNA level diversity (which does not provide strain-level insights) and on microbiomes from diverse environments.

      One of the datasets analysed here is a subset of a cross-sectional cohort from the Human Microbiome Project. The other dataset comes from a single individual sampled longitudinally over 18 months. This second dataset allowed the authors to not only assess the links between different levels of diversity at single timepoints, but test if high diversity at a given timepoint is associated with increased strain-level diversity at future timepoints.

      Understanding eco-evolutionary dynamics of diversity in natural microbial communities is an important question that remains challenging to address. The paper is well-written and the detailed description of the methodological approaches and statistical analyses is exemplary. Most of the analyses carried out in this study seem to be technically sound.

      We thank the reviewer for their kind words, comments, and suggestions.

      The major limitation of this study comes with the fact that only correlations are presented, some of which are rather weak, contrast each other, or are based on a small number of data points. In addition, finding that diversity at a given taxonomic rank is associated with diversity within a given taxon is a pattern that can be explained by many different underlying processes, e.g. species-area relationships, nutrient (diet) diversity, stressor diversity, immigration rate, and niche creation by other microbes (i.e. DBD). Without experiments, it remains vague if DBD is the underlying process that acts in these communities based on the observed patterns.

      We thank the reviewer for their comments. First, regarding the issue of this being a correlative study, we now more clearly acknowledge that mechanistic studies (perhaps in experimental settings) are required to fully elucidate DBD and BQH dynamics. However, we note that our correlational study from natural communities is complementary to experimental and modeling studies, to test the extent to which their predictions hold in more complex, realistic settings. This is now mentioned throughout the manuscript, most explicitly at the end of the Introduction:

      “Although such analyses of natural diversity cannot fully control for unmeasured confounding environmental factors, they are an important complement to controlled experimental and theoretical studies which lack real-world complexity.”

      Second, to increase the number of data points analyzed in the Poyet study, we now include 15 species and four different hosts (new Figure 5). The association between community diversity and gene loss is now much more statistically robust, and consistent across the Poyet and HMP time series.

      Third, we acknowledge more clearly in the Discussion that other processes, including diet and other environmental factors can generate the DBD pattern. We also now stress more prominently the possibility that strain migration across hosts may be responsible for the patterns observed. For example, in Figure 1, we illustrate the possibility of strain migration generating the patterns we observe.

      Below we quote a paragraph that we have now added in the Discussion:

      "Second, we cannot establish causal relationships without controlled experiments. We are therefore careful to conclude that positive diversity slopes are consistent with the predictions of DBD, and negative slopes with EC, but unmeasured environmental drivers could be at play. For example, increased dietary diversity could simultaneously select for higher community diversity and also higher intra-species diversity. In our previous study, we found that positive diversity slopes persisted even after controlling for potential abiotic drivers such as pH and temperature (Madi et al., 2020), but a similar analysis was not possible here due to a lack of metadata. Neutral processes can account for several ecological patterns such as species-area relationships (Hubbell, 2001), and must be rejected in favor of niche-centric models like DBD or EC. Using neutral models without DBD or EC, we found generally flat or negative diversity slopes due to sampling processes alone and that positive slopes were hard to explain with a neutral model (Madi et al., 2020). These models were intended mainly for 16S rRNA gene sequence data, but we expect the general conclusions to extend to metagenomic data. Nevertheless, further modeling and experimental work will be required to fully exclude a neutral explanation for the diversity slopes we report in the human gut microbiome.”

      Finally, we now put more emphasis on the importance of migration (strain invasion) as a non-exclusive alternative to de novo mutation and gene gain/loss. This is mentioned in the Abstract and is also illustrated in the revised Figure 1.

      Another limitation is that the total number of reads (5 mio for the longitudinal dataset and 20 mio for the cross-sectional dataset) is low for assessing strain-level diversity in complex communities such as the human gut microbiota. This is probably the reason why the authors only looked at one species with sufficient coverage in the longitudinal dataset.

      Indeed, this is a caveat which means we can only consider sub-species diversity in relatively abundant species. Nevertheless, this allows us to study dozens of species in the HMP and 15 in the more frequent Poyet time series. As more deeply sequenced metagenomes become available, future studies will be able to access the rarer species to test whether the same patterns hold or not. This is now mentioned prominently as a caveat our study in the second Discussion paragraph:

      “First, using metagenomic data from human microbiomes allowed us to study genetic diversity, but limited us to considering only relatively abundant species with genomes that were well-covered by short sequence reads. Deeper or more targeted sequencing may permit us to determine whether the same patterns hold for rarer members of the microbiome. However, it is notable that the majority of the dozens of species across the two datasets analyzed support DBD, suggesting that the phenomenon may generalize.”

      We also note that rarefaction was only applied to calculate community richness, not to estimate sub-species diversity. We apologize for this confusion, which is now clarified in the Methods as follows:

      “SNV and gene content variation within a focal species were ascertained only from the full dataset and not the rarefied dataset.”

      Analyzing the effect of diversity at a given timepoint on strain-level diversity at a later timepoint adds an important new dimension to this study which was not assessed in the previous study about the DBD in microbiomes by some of the authors. However, only a single species was analysed in the longitudinal dataset and comparisons of diversity were only done between two consecutive timepoints. This dataset could be further exploited to provide more insights into the prevailing patterns of diversity.

      We thank the reviewer for raising this point. We now have considered all 15 species for which there was sufficient coverage from the Poyet dataset, which included four different stool donors. Additionally, in the HMP dataset, we analyze 54 species across 154 hosts, with both datasets showing the same correlation between community diversity and gene loss.

      Additionally, we followed the suggestion of the reviewer of examining additional time lags, and in Figure 5 we do observe a dependency on time. This is now described in the Results as follows:

      “Using the Poyet dataset, we asked whether community diversity in the gut microbiome at one time point could predict polymorphism change at a future time point by fitting GAMs with the change in polymorphism rate as a function of the interaction between community diversity at the first time point and the number of days between the two time points. Shannon diversity at the earlier time point was correlated with increases in polymorphism (consistent with DBD) up to ~150 days (~4.5 months) into the future (Figure S4), but this relationship became weaker and then inverted (consistent with EC) at longer time lags (Fig 5A, Table S8, GAM, P=0.023, Chi-square test). The diversity slope is approximately flat for time lags between four and six months, which could explain why no significant relationship was found in HMP, where samples were collected every ~6 months. No relationship was observed between community richness and changes in polymorphism (Table S8, GAM, P>0.05).”

      Finally, the evidence that gene loss follows increase in diversity is weak, as very few genes were found to be lost between two consecutive timepoints, and the analysis is based on only a single species. Moreover, while positive correlation were found between overall community diversity and gene family diversity in single species, the opposite trend was observed when focusing on pathway diversity. A more detailed analysis (of e.g. the functions of the genes and pathways lost/gained) to explain these seemingly contrasting results and a more critical discussion of the limitations of this study would be desirable.

      We agree that our previous analysis of one species in one host provided weak support for gene loss following increases in diversity. As described in the response above, we have now expanded this analysis to 15 focal species and 4 independent hosts with extensive time series. We now analyze this larger dataset and report the more statistically robust results as follows:

      “We found that community Shannon diversity predicted future gene loss in a focal species, and this effect became stronger with longer time lags (Fig 5B, Table S9, GLMM, P=0.006, LRT for the effect of the interaction between the initial Shannon diversity and time lag on the number of genes lost). The model predicts that increasing Shannon diversity from its minimum to its maximum would result in the loss of 0.075 genes from a focal species after 250 days. In other words, about one of the 15 focal species considered would be expected to lose a gene in this time frame.

      Higher Shannon diversity was also associated with fewer gene gains, and this relationship also became stronger over time (Fig 5C, Table S9, GLMM, P=1.11e-09, LRT). We found a similar relationship between community species richness and gene gains, although the relationship was slightly positive at shorter time lags (Fig 5D, Table S9, GLMM, P=3.41e-04, LRT). No significant relationship was observed between richness and gene loss (Table S9, GLMM, P>0.05). Taken together with the HMP results (Fig 4), these longer time series reveal how the sign of the diversity slope can vary over time and how community diversity is generally predictive of reduced focal species gene content.”

      As described in detail in the response to Reviewer 1 above, we found that the HUMAnN2 pathway analyses previously described suffered from technical challenges and we deemed them inconclusive. We have therefore removed the pathway results from the manuscript.

      Reviewer #3 (Public Review):

      This work provides a series of tests of hypothesis, which are not mutually exclusive, on how genomic diversity is structured within human microbiomes and how community diversity may influence the evolution of a focal species.

      Strengths:

      The paper leverages on existing metagenomic data to look at many focal species at the same time to test for the importance of broad eco-evolutionary hypothesis, which is a novelty in the field.

      Thank you for the succinct summary and recognition of the strengths of our work.

      Weaknesses:

      It is not very clear if the existing metagenomic data has sufficient power to test these models.

      It is not clear, neither in the introduction nor in the analysis what precise mechanisms are expected to lead to DBD.

      The conclusion that data support DBD appears to depend on which statistics to measure of community diversity are used. Also, performing a test to reject a null neutral model would have been welcome either in the results or in the discussion.

      In our revised manuscript, we emphasize several caveats – including that we only have power to test these hypotheses in focal species with sufficient metagenomic coverage to measure sub-species diversity. We also describe more in the Introduction how the processes of competition and niche construction can lead to DBD. We also acknowledge that unmeasured abiotic drivers of both community diversity and sub-species diversity could also lead to the observed patterns. Throughout the manuscript, we attempt to describe the results and acknowledge multiple possible interpretations, including DBD and EC acting with different strengths on different species and time scales. Our previous manuscript assessing the evidence for DBD using 16S rRNA gene amplicon data from the Earth Microbiome Project (Madi et al., eLife 2020) assessed null models based on neutral ecological theory, and found it difficult to explain the observation of generally positive diversity slopes without invoking a non-neutral mechanism like DBD. While a new null model tailored to metagenomic data might provide additional nuance, we think developing one is beyond the scope of the manuscript – which is in the format of a short ‘Research Advance’ to expand on our previous eLife paper, and we expect that the general results of our previously reported null model provide a reasonable intuition for our new metagenomic analysis. This is now mentioned in the Discussion as follows:

      “In our previous study, we found that positive diversity slopes persisted even after controlling for potential abiotic drivers such as pH and temperature (Madi et al., 2020), but a similar analysis was not possible here due to a lack of metadata. Neutral processes can account for several ecological patterns such as species-area relationships (Hubbell, 2001), and must be rejected in favor of niche-centric models like DBD or EC. Using neutral models without DBD or EC, we found generally flat or negative diversity slopes due to sampling processes alone and that positive slopes were hard to explain with a neutral model (Madi et al., 2020). These models were intended mainly for 16S rRNA gene sequence data, but we expect the general conclusions to extend to metagenomic data. Nevertheless, further modeling and experimental work will be required to fully exclude a neutral explanation for the diversity slopes we report in the human gut microbiome.”

    1. Author Response:

      Reviewer #1 (Public Review):

      5.The reported data point to an important role of the premotor and parietal regions of the left as compared to the right hemisphere in the control of ipsilateral and contralateral limb movements. These are also the regions where the electrodes were primarily located in both subgroups of patients. I have 2 concerns in this respect. The first concern refers to the specific locus of these electrodes. For premotor cortex, the authors suggest PMd as well as PMv as potential sites for these bilateral representations. The other principal site refers to parietal cortex but this covers a large territory. It would help if more specific subregions for the parietal cortex can be indicated, if possible. Do the focal regions where electrodes were positioned refer to the superior vs inferior parietal cortex (anterior or posterior), or intra-parietal sulcus. Second, the manuscript's focus on the premotor-parietal complex emerges from the constraints imposed by accessible anatomical locations in the participants but does not preclude the existence of other cortical sites as well as subcortical regions and cerebellum for such bilateral representations. It is meaningful to clarify this and/or list this as a limitation of the current approach.

      On the first issue, we have updated the manuscript to specify the subregion within the parietal cortex in which we see stronger across-arm generalization - namely, the superior parietal cortex. On the second issue, we have added text in the Discussion that reference subcortical areas shown to exhibit laterality differences in bimanual coordination, providing a more holistic picture of bimanual representations across the brain. In addition, we acknowledge that with our current patient population we are limited to regions with substantial electrode coverage, which does not include all areas of the brain.

      6.The evidence for bilateral encoding during unilateral movement opens perspectives for a better understanding of the control of bimanual movements which are abundant during every day life. In the discussion, the authors refer to some imaging studies on bimanual control in order to infer whether the obtained findings may be a consequence of left hemisphere specialization for bimanual movement control, leading to speculations about the information that is being processed for each of both limb movements. Another perspective to consider is the possibility that making a movement with one limb may require postural stabilization in the trunk and contralateral body side, including a contribution from the opposite limb that is supposedly resting on the start button. Have the authors considered whether this postural mechanism could (partly) account for this bilateral encoding mechanism, in particular, because it appears more prominent during movement execution as compared to preparation. Furthermore, could the prominence of bilateral encoding during movement execution be triggered by inflow of sensory information about both limbs from the visual as well as the somatosensory systems.

      Thank you for these comments. We have added a paragraph to the Discussion to address the hypothesis that some component of ipsilateral encoding may be related to postural stabilization.

      In response to the final point in this comment, we agree that bilateral information during execution could be reflective of afferent inputs (somatosensory and/or visual). However, the encoding model shows that activity in premotor and parietal regions are well predicted based on kinematics during the task. While visual and somatosensory system information are likely integrated in these areas, the kinematic encoding would point to a more movement-based representation.

      Reviewer #2 (Public Review):

      Weaknesses: 1. Although the current human ECoG data set is valuable, there is still large variability in electrode coverage across the patients (I fully acknowledge the difficulty). This makes statistical assessment a bit tricky. The potential factors of interest in the current study would be Electrode (=Region), Subject, Hemisphere, and their interactions. The tricky part is that Electrode is nested within Subject, and Subject is nested within Hemisphere. Permutation-based ANOVA used for the current paper requires proper treatment of these nested factors when making permutations (Anderson and Braak, 2003). With this regard, sufficient details about how the authors treated each factor, for instance, in each pbANOVA, are not provided in the current version of the manuscript. Similarly, the scope of statistical generalizability, whether the inference is within-sample or population-level, for the claims (e.g., statement about the hemispheric or regional difference) needs to be clarified.

      We discuss at length the issue of electrode variability and have addressed this in the revised manuscript. Graphically, we have added a Supplemental Figure (S2). Statistically, we appreciate the point about the need for the analysis to address the nested structure of the data. We have redone all of the statistics, now using a permutation-based linear mixed effects model with a random effect of patient. This approach did not change any of the findings.

      As to the comment about hemispheric or regional differences, the data show that both are important factors. Our hemispheric effect is characterized by stronger ipsilateral encoding in the left hemisphere and subsequently better across-arm generalization (Figures 2-4). We then examine the spatial distribution of electrodes that generalized well or poorly and found clusters in both hemispheres of electrodes that generalize poorly. In contrast, only in the left hemisphere did we find clusters of electrodes that generalize well. These electrodes were localized to PMd, PMv and superior parietal cortex (Fig 5D). In summary, we argue that activity patterns in M1 are similar in the left and right hemispheres, but there is a marked asymmetry for activity patterns over premotor and parietal cortices.

      Additional contexts that would help readers interpret or understand the significance of the work: The greater amount of shared movement representation in the left hemisphere may imply the greater reliance of the left arm on the left hemisphere. This may, in turn, lead to the greater influence of the ongoing right arm motion on the left arm movement control during the bimanual coordination. Indeed, this point is addressed by the authors in the Discussion (page 15, lines 26-41). One critical piece of literature missing in this context is the work done by Yokoi, Hirashima, and Nozaki (2014). In the experiments using the bimanual reaching task, they in fact found that the learning by the left arm is to the greater degree influenced by the concurrent motion of the right arm than vice versa (Yokoi et al., J Neurosci, 2014). Together with Diedrichsen et al. (2013), this study will strengthen the authors' discussion and help readers interpret the present result of left hemisphere dominance in the context of more skillful bimanual action.

      The Yokoi paper is a very important paper in revealing hemispheric asymmetries during skilled bimanual movements. However, we think it is problematic to link the hemispheric asymmetries we observe to the behavioral effects reported in the Yokoi paper (namely, that the nondominant, left arm was more strongly influenced by the kinematics of the right arm). One could hypothesize that the left hemisphere, given its representation of both arms, could be controlling both arms in some sort of direct way (and thus the action of the right arm will have an influence on left arm movement given the engagement of the same neural regions for both movements). It is also possible that the left hemisphere is receiving information about the state of both the right and left arms, and this underlies the behavioral asymmetry reported in Yokoi.

      Reviewer #3 (Public Review):

      In the present work, Merrick et al. analyzed ECoG recordings from patients performing out-and-back reaching movements. The authors trained a linear model to map kinematic features (e.g., hand speed, target position) to high frequency ECoG activity (HFA) of each electrode. The two primary findings were: 1) encoding strength (as assessed by held-out R2 values) of ipsilateral and contralateral movements was more bilateral in the left hemisphere than in the right and 2) across-arm generalization was stronger in the left hemisphere than in the right. As the authors point out in the Introduction, there are known 'asymmetries between the two hemispheres in terms of praxis', so it may not be surprising to find asymmetries in the kinematic encoding of the two hemispheres (i.e., the left hemisphere contributes 'more equally' to movements on either side of the body than the right hemisphere).

      There is one point that I feel must be addressed before the present conclusions can be reached and a second clarification that I feel will greatly improve the interpretability of the results.

      First, as is often the case when working with patients, the authors have no control over the recording sites. This led to some asymmetries in both the number of electrodes in each hemisphere (as the authors note in the Discussion) and (more importantly) in the location of the recording electrodes. Recording site within a hemisphere must be controlled for before any comparisons between the hemispheres can be made. For example, the authors note that 'the contralateral bias becomes weaker the further the electrodes are from putative motor cortex'. If there happen to be more electrodes placed further from M1 in the left hemisphere (as Supplementary Figure 1 seems to suggest), than we cannot know whether the results of Figures 2 and 3 are due to the left hemisphere having stronger bilateral encoding or simply more electrodes placed further from M1.

      The reviewer makes a very valid point and this comment has led to our inclusion of a new Supplementary Figure, S2, in which we quantify the percentage of electrodes in each subregion.

      Second, it would be useful if the authors provided a bit of clarification about what type of kinematic information the linear model is using to predict HFA. I believe the paragraph titled 'Target modulation and tuning similarity across arms' suggests that there is very little across-target variance in the HFA signal. Does this imply that the model is primarily ignoring the Phi and Theta (as well as their lagged counterparts) and is instead relying on the position and speed terms? How likely is it that the majority of the HFA activity around movement onset reflects a condition-invariant 'trigger signal' (Kaufman, et al., 2016). This trigger signal accounts for the largest portion of neural variance around movement onset (by far), and the weight of individual neurons in trigger signal dimensions tend to be positive, which means that this signal will be strongly reflected in population activity (as measured by ECoG). This interpretation does not detract from the present results in any way, but it may serve to clarify them.

      To address this comment, we have added a new figure (Fig 6) which shows the relative contribution of each kinematic feature as well as their average weights across time for both contralateral and ipsilateral movements. This figure also addresses the reviewer’s question about the contribution of the target position to the model. As can be seen, features that reflect timing/movement initiation (position, speed) make a larger contribution compared to the two features which capture directional tuning (theta, phi). As the reviewer suggested, this result is in line Kaufman et al. (2016) which reported that a condition-invariant ‘trigger signal’ comprises the largest component of neural activity. We note that the target dependent features theta and phi still make a substantial contribution to the model (relative contribution: contra = 32%, ipsi = 37%). Previously, we have tested the contribution of the theta and phi features by comparing two models, one that only used position and speed (Movement model) and one that also included the two angular components phi and theta (Target Model). For a subset of electrodes, the held-out predictions were significantly better using the Target Model, a result we take as further evidence of electrode tuning within our dataset.

      The figure below shows an electrode located in M1 that is tuned to targets when the patient reached with their contralateral arm as an example. We believe that having an explicit depiction of how the four features contribute to the HFA predictions will help the reader evaluate the model. These points are now addressed in the text in the results section discussing Figure 6.

    1. Author Response

      Reviewer #1 (Public Review):

      [...] Recently, pupil dilation was linked to cholinergic and noradrenergic neuromodulation as well as cortical state dynamics in animal research. This work adds substantially to this growing research field by revealing the temporal and spatial dynamics of pupil-linked changes in cortical state in a large sample of human participants.

      The analyses are thorough and well conducted, but some questions remain, especially concerning unbiased ways to account for the temporal lag between neural and pupil changes. Moreover, it should be stressed that the provided evidence is of indirect nature (i.e., resting state pupil dilation as proxy of neuromodulation, with multiple neuromodulatory systems influencing the measure), and the behavioral relevance of the findings cannot be shown in the current study.

      Thank you for your positive feedback and constructive suggestions. We are especially grateful for the numerous pointers to other work relevant to our study.

      1. Concerning the temporal lag: The authors' uniformly shift pupil data (but not pupil derivative) in time for their source-space analyses (see above). However, the evidence for the chosen temporal lags (930 ms and 0 ms) is not that firm. For instance, in the cited study by Reimer and colleagues [1] , cholinergic activation shows a temporal lag of ~ 0.5 s with regard to pupil dilation - and the authors would like to relate pupil time series primarily to acetylcholine. Moreover, Joshi and colleagues [2] demonstrated that locus coeruleus spikes precede changes in the first derivative of pupil dilation by about 300 ms (and not 0 ms). Finally, in a recent study recording intracranial EEG activity in humans [3], pupil dilation lagged behind neural events with a delay between ~0.5-1.7s. Together, this questions the chosen temporal lags.

      More importantly, Figures 3 and S3 demonstrate variable lags for different frequency bands (also evident for the pupil derivative), which are disregarded in the current source-space analyses. This biases the subsequent analyses. For instance, Figure S3 B shows the strongest correlation effect (Z~5), a negative association between pupil and the alpha-beta band. However, this effect is not evident in the corresponding source analyses (Figure S5), presumably due to the chosen zero-time-lag (the negative association peaked at ~900 ms)).

      As the conducted cross-correlations provided direct evidence for the lags for each frequency band, using these for subsequent analyses seems less biased.

      This is an important point and we gladly take the opportunity to clarify this in detail. In essence, choosing one particular lag over others was a decision we took to address the multi-dimensional issue of presenting our results (spectral, spatial and time dimensions) and fix one parameter for the spatial description (see e.g. Figure 4). It is worth pointing out first that our analyses were all based on spectral decompositions that necessarily have limited temporal resolutions. Therefore, any given lag represents the center of a band that we can reasonably attribute to a time range. In fact, Figure 3C shows how spread out the effects are. It also shows that the peaks (troughs) of low and high frequency ranges align with our chosen lag quite well, while effects in the mid-frequency range are not “optimally” captured.

      As picking lags based on maximum effects may be seen as double dipping, we note that we chose 0.93 sec a priori based on the existing literature, and most prominently based on the canonical impulse response of the pupil to arousing stimuli that is known to peak at that latency on average (Hoeks & Levelt, 1993; Wierda et al. 2012; also see Burlingham et al.; 2021). This lag further agrees with the results of reference [3] cited by the reviewer as it falls within that time range, and with Reimer et al.’s finding (cited as [1] above), as well as Breton-Provencher et al. (2019) who report a lag of ~900 ms sec (see their Supplementary Figure S8) between noradrenergic LC activation and pupil dilation. Finally, note that it was not our aim to relate pupil dilations to either ACh or NE in particular as we cannot make this distinction based on our data alone. Instead, we point out and discuss the similarities of our findings with time lags that have been reported for either neurotransmitter before.

      With respect to using different lags, changing the lag to 0 or 500 msec is unlikely to alter the reported effects qualitatively for low- and high frequency ranges (see Figure 3C), as both the pupil time series as well as fluctuations in power are dominated by very slow fluctuations (<< 1 Hz). As a consequence, shifting the signal by 500 msec has very little impact. For comparison, below we provide the reviewer with the results presented in Figure 4 but computed based on zero (Figure R1) and 500-msec (Figure R2) lags. While there are small quantitative differences, qualitatively the results remain mostly identical irrespective of the chosen lag.

      Figure R1. Figure equivalent to main Figure 4, but without shifting the pupil.

      In sum, choosing one common lag a priori (as we did here) does not necessarily impose more of a bias on the presentation of the results than choosing them post-hoc based on the peaks in the cross-correlograms. However, we have taken this point as a motivation to revise the Results and Methods sections where applicable to strengthen the rationale behind our choice. Most importantly, we changed the first paragraph that mentions and justifies the shift as follows, because original wording may have given the false impression that the cross-correlation results influenced lag choice:

      “Based on previous reports (Hoeks & Levelt, 1993; Joshi et al., 2016; Reimer et al., 2016), we shifted the pupil signal 930 ms forward (with respect to the MEG signal). We introduced this shift to compensate for the lag that had previously been observed between external manipulations of arousal (Hoeks & Levelt, 1993) as well as spontaneous noradrenergic activity (Reimer et al., 2016) and changes in pupil diameter. In our data, this shift also aligned with the lags for low- and high-frequency extrema in the cross-correlation analysis (Figure 3B).”

      Figure R2. Figure equivalent to main Figure 4, but with shifting the pupil with respect to the MEG by 500 ms.

      Related to this aspect: For some parts of the analyses, the pupil time series was shifted with regard to the MEG data (e.g., Figure 4). However, for subsequent analyses pupil and MEG data were analyzed in concurrent 2 s time windows (e.g., Figure 5 and 6), without a preceding shift in time. This complicates comparisons of the results across analyses and the reasoning behind this should be discussed.

      The signal has been shifted for all analyses that relate to pupil diameter (but not pupil derivative). We have added versions of the following statement in the respective Results and Methods section to clarify (example from Results section ‘Nonlinear relations between pupil-linked arousal and band-limited cortical activity’):

      “In keeping with previous analyses, we shifted the pupil time series forward by 930 msec, while applying no shift to the pupil derivative.”

      1. The authors refer to simultaneous fMRI-pupil studies in their background section. However, throughout the manuscript, they do not mention recent work linking (task-related) changes in pupil dilation and neural oscillations (e.g., [4-6]) which does seem relevant here, too. This seems especially warranted, as these findings in part appear to disagree with the here-reported observations. For instance, these studies consistently show negative pupil-alpha associations (while the authors mostly show positive associations). Moreover, one of these studies tested for links between pupil dilation and aperiodic EEG activity but did not find a reliable association (again conflicting with the here-reported data). Discussing potential differences between studies could strengthen the manuscript.

      We have added a discussion of the suggested works to our Discussion section. We point out however that a recent study (Podvalny et al., https://doi.org/10.7554/eLife.68265) corroborates our finding while measuring resting-state pupil and MEG simultaneously in a situation very similar to ours. Also, we note that Whitmarsh et al. (2021) (reference [6]) is actually in line with our findings as we find a similar negative relationship between alpha-range activity in somatomotor cortices and pupil size.

      Please also take into account that results from studies of task- or event-related changes in pupil diameter (phasic responses) cannot be straightforwardly compared with the findings reported here (focusing on fluctuations in tonic pupil size) , due to the inverse relationship between tonic (or baseline) and phasic pupil response (e.g. Knapen et al., 2016). This means that on trials with larger baseline pupil diameter, phasic pupil dilation will be smaller and vice versa. Hence, a negative relation between the evoked change in pupil diameter and alpha-band power can very well be consistent with the positive correlation between tonic pupil diameter and alpha-band activity that we report here for visual cortex.

      In section ‘Arousal modulates cortical activity across space, time and frequencies’ we have added:

      “Seemingly contradicting the present findings, previous work on task-related EEG and MEG dynamics reported a negative relationship between pupil-linked arousal and alpha-range activity in occipito-parietal sensors during visual processing (Meindertsma et al, 2017) and fear conditioning (Dahl et al. 2020).Note however that results from task-related experiments, that focus on evoked changes in pupil diameter rather than fluctuations in tonic pupil size, cannot be directly compared with our findings. Similar to noradrenergic neurons in locus coeruleus (Aston-Jones & Cohen, 2005), phasic pupil responses exhibit an inverse relationship with tonic pupil size (Knapen et al., 2016). This means that on trials with larger baseline pupil diameter (e.g. during a pre-stimulus period), the evoked (phasic) pupil response will be smaller and vice versa. As a consequence, a negative correlation between alpha-band activity in the visual cortex and task-related phasic pupil responses does not preclude a positive correlation with tonic pupil size during baseline or rest as reported here. In line with this, Whitmarsh et al., 2021 found a negative relationship between alpha-activity and pupil size in the somatosensory cortex that agrees with our finding. Although using an event-related design to study attention to tactile stimuli, this relationship occurred in the baseline, i.e. before observing any task-related phasic effects on pupil-linked arousal or cortical activity.”

      In section ‘Arousal modulation of cortical excitation-inhibition ratio’ we have added: “The absence of this effect in visual cortices may explain why Kosciessa et al. (2021) found no relationship between pupil-linked arousal and spectral slope when investigating phasic pupil dilation in response to a stimulus during visual task performance. However, this behavioral context, associated with different arousal levels, likely also changes E/I in the visual cortex when compared with the resting state (Pfeffer et al., 2018).”

      Finally, in the Conclusion we added (note: ‘they’ = the present results): “Further, they largely agree with similar findings of a recent independent report (Podvalny et al., 2021).”

      Related to this aspect: The authors frequently relate their findings to recent work in rodents. For this it would be good to consider species differences when comparing frequency bands across rodents and primates (cf. [7,8]).

      Throughout our Results section we have mainly remained agnostic with respect to labeling frequency ranges when drawing between-species comparisons, and have only reverted to it as a justification for a dimension reduction for some of the presented analysis. Following your comment however, we have phrased the following section in the Discussion, section ‘Arousal modulates cortical activity across space, time and frequencies’, more carefully:

      “The low-frequency regime referred to in rodent work (2—10Hz; e.g., McGinley et al., 2015) includes activity that shares characteristics with human alpha rhythms (3—6Hz; Nestogel and McCormick, 2021; Senzai et al. 2019). The human equivalent however clearly separates from activity in lower frequency bands and,here, showed idiosyncratic relationships with pupil-linked arousal.”

      1. Figure 1 highlights direct neuromodulatory effects in the cortex. However, seminal [9-11] and more recent work [12,13] demonstrates that noradrenaline and acetylcholine also act in the thalamus which seems relevant concerning the interpretation of low frequency effects observed here. Moreover, neural oscillations also influence neuromodulatory activity, thus the one-headed arrows do not seem warranted (panel C) [3,14].

      This is a very good point. First, we would like to note that we have extended on acknowledging thalamic contributions to low-frequency (specifically alpha) effects in response to the Reviewer’s point 11 (‘Recommendations for authors’ section below). Also, we have added a reference to the role of potential top-down (reverse) influences to our Discussion, section ‘Arousal modulates cortical activity across space, time and frequencies’, as follows:

      “Further, we note that our analyses and interpretations focus on arousal-related neuromodulatory influences on cortical activity, whereas recent work also supports a reverse “top-down” route, at least for frontal cortex high-frequency activity on LC spiking activity (Totah et al., 2021).”

      Ultimately, however, we decided to leave the arrows in Figure 1C uni-directional to keep in line with the rationale of our research that stems mostly from rodent work, which also emphasises the indicated directionality. Also, reference [3] is highly interesting for us because it actually aligns with our data: The authors show that a spontaneous peak of high-frequency band activity (>70 Hz) in insular cortex precedes a pupil dilation peak (or plateau) in two of three participants by ~500msec (which mimics a pattern found for task-evoked activity; see their Figure 5b/c). We find a maximum in our cross-correlation between pupil size and high frequency band activity (>64 Hz) that indicates a similar lag (see our Figure 3B). Importantly, both results do not rule out a common source of neuromodulation for the effects. We have added the following to the end of the section ‘An arousal-triggered cascade of activity in the resting human brain’:

      “In fact, Kucyi & Parvizi (2020) found spontaneous peaks of high-frequency band activity (>70 Hz) in the insular cortex of three resting surgically implanted patients that preceded pupil dilation by ~500msec - a time range that is consistent with the lag of our cross-correlation between pupil size and high frequency (>64Hz) activity (see Figure 3B). Importantly, they showed that this sequence mimicked a similar but more pronounced pattern during task performance. Given the purported role of the insula (Menon & Uddin, 2015), this finding lends support to the idea that spontaneous covariations of pupil size and cortical activity signal arousal events related to intermittent 'monitoring sweeps' for behaviourally relevant information.”

      1. In their discussion, the authors propose a pupil-linked temporal cascade of cognitive processes and accompanying power changes. This argument could be strengthened by showing that earlier events in the cascade can predict subsequent ones (e.g., are the earlier low and high frequency effects predictive of the subsequent alpha-beta synchronization?)-

      We added this cascade angle as one possible interpretation of the observed effects. We fully agree that this is an interesting question but would argue that this would ideally be tested in follow-up research specifically designed for that purpose. The suggested analysis would add a post-hoc aspect to our exploratory investigation in the absence of a suitable contrast, while also potentially side-tracking the main aim of the study. We have revised the language in this section and added the following changes (bold) to the last paragraph to emphasise the speculatory aspect, and clarify what we think needs to be done to look into this further and with more explanatory power.

      “The three scenarios described here are not mutually exclusive and may explain one and the same phenomenon from different perspectives. Further, it remains possible that the sequence we observe comprises independent effects with specific timings. A pivotal manipulation to test these assumptions will be to contrast the observed sequence with other potential coupling patterns between pupil-linked arousal and cortical activity during different behavioural states.”

    1. Author Response

      Reviewer #1 (Public Review):

      This thorough study expands our understanding of BMP signaling, a conserved developmental pathway, involved in processes diverse such as body patterning and neurogenesis. The authors applied multiple, state-of-art strategies to the anthozoan Nematostella vectensis in order to first identify the direct BMP signaling targets - bound by the activated pSMAD1/5 protein - and then dissect the role of a novel pSMAD1/5 gradient modulator, zwim4-6. The list of target genes features multiple developmental regulators, many of which are bilaterally expressed, and which are notably shared between Drosophila and Xenopus. The analysis identified in particular zswim4-6 a novel nuclear modulator of the BMP pathway conserved also in vertebrates. A combination of both loss-of-function (injection of antisense morpholino oligonucleotide, CRISPR/Cas9 knockout, expression of dominant negative) and gain-of-function assays, and of transcriptome sequencing identified that zwim acts as a transcriptional repression of BMP signaling. Functional manipulation of zswim5 in zebrafish shows a conserved role in modulating BMP signaling in a vertebrate.

      The particular strength of the study lies in the careful and thorough analysis performed. This is solid developmental work, where one clear biological question is progressively dissected, with the most appropriate tools. The functional results are further validated by alternative approaches. Data is clearly presented and methods are detailed. I have a couple of comments.

      1) I was intrigued - as the authors - by the fact that the ChiP-Seq did not identify any known BMP ligand bound by pSMAD1/5. Are these genes found in the published ChiP-Seq data of the other species used for the comparative analysis? One hypothesis could be that there is a change in the regulatory interactions and that the initial set-up of the gradient requires indeed a feedback loop, which is then turned off at later gastrula. In this case, immunoprecipitation at early gastrula, prior to the set-up of the pSMAD1/5 gradient, could reveal a different scenario. Alternately, the regulation could be indirect, for example, through RGM, an additional regulator of BMP signaling expressed on the side of lower BMP activity, which is among the targets of the ChiP-Seq. This aspect could be discussed. Additionally, even if this is perhaps outside the scope of this study, I think it would be informative to further assess the effect of ZSWIM manipulation on RGM (and vice versa).

      Indeed, BMP genes are direct BMP signaling targets in Drosophila (dpp) (Deignan et al., 2016, https://doi.org/10.1371/journal.pgen.1006164) and frog (bmp2, bmp4, bmp5, bmp7) (Stevens et al., 2021, https://doi.org/10.1242/dev.145789). Of all these ligands, only the dorsally expressed Xenopus bmp2 is repressed by BMP signaling, while another dorsally expressed Xenopus BMP gene admp is not among the direct targets. All other BMP genes listed here are expressed in the pMad/pSMAD1/5/8-positive domain and are activated by BMP signaling.

      In Nematostella, we do not find BMP genes among the ChIP-Seq targets, but this is not that surprising considering the dynamics of the bmp2/4, bmp5-8 and chordin expression, as well as the location of the pSMAD1/5-positive cells. In late gastrulae/early planulae, Chordin appears to be shuttling BMP2/4 and BMP5-8 away from their production source and over to the gdf5-like side of the directive axis (Genikhovich et al., 2015; Leclere and Rentsch, 2014). By 4 dpf, chordin expression stops, and BMP2/4 and BMP5-8 start to be both expressed AND signal in the mesenteries. If bmp2/4 and bmp5-8 expression were directly suppressed by pSMAD1/5 (as is the case chordin or rgm expression), this mesenterial expression would not be possible. Therefore, in our opinion, it is most likely that at late gastrula and early planula the regulation of bmp2/4 and bmp5-8 expression by BMP signaling is indirect. We do not have an explanation for why gdf5-like (another BMP gene expressed on the “high pSMAD1/5” side) is not retrieved as a direct BMP target in our ChIP data. Since we do not understand well enough how BMP gene expression is regulated, we do not discuss this at length in the manuscript.

      As the Reviewer suggested, we analyzed the effect of ZSWIM4-6 KD on the expression of rgm. Expectedly, since it is expressed on the “low BMP side”, its expression was strongly expanded (Figure 6 - Figure Supplement 4)

      2) I do not fully understand the rationale behind the choice of performing the comparative assays in zebrafish: as the conservation was initially identified in Xenopus, I would have expected the experiment to be performed in frog. Furthermore, reading the phylogeny (Figure 4A), it is not obvious to me why ZSWIM5 was chosen for the assay (over the other paralog ZSWIM6). Could the Authors comment on this experiment further?

      The comparison was done in zebrafish because we were planning to generate zswim5 mutants, whose analysis is currently in progress. ZSWIM6 is not expressed at the developmental stages we were interested in, while ZSWIM5 was, based on available zebrafish expression data (White et al., 2017):

      Reviewer #2 (Public Review):

      The authors provide a nice resource of putative direct BMP target genes in Nematostella vectensis by performing ChIP-seq with an anti-pSmad1/5 antibody, while also performing bulk RNA-seq with BMP2/4 or GDF5 knockdown embryos. Genes that exhibit pSmad1/5 binding and have changes in transcription levels after BMP signaling loss were further annotated to identify those with conserved BMP response elements (BREs). Further characterization of one of the direct BMP target genes (zswim4-6) was performed by examining how expression changed following BMP receptor or ligand loss of function, as well as how loss or gain of function of zswim4-6 affected development and BMP signaling. The authors concluded that zswim4-6 modulates BMP signaling activity and likely acts as a pSMAD1/5 dependent co-repressor. However, the mechanism by which zswim4-6 affects the BMP gradient or interacts with pSMAD1/5 to repress target genes is not clear. The authors test the activity of a zswim4-6 homologue in zebrafish (zswim5) by over-expressing mRNA and find that pSMAD1/5/9 labeling is reduced and that embryos have a phenotype suggesting loss of BMP signaling, and conclude that zswim4-6 is a conserved regulator of BMP signaling. This conclusion needs further support to confirm BMP loss of function phenotypes in zswim5 over-expression embryos.

      Major comments

      1) The BMP direct target comparison was performed between Nematostella, Drosophila, and Xenopus, but not with existing data from zebrafish (Greenfeld 2021, Plos Biol). Given the functional analysis with zebrafish later in the paper it would be nice to see if there are conserved direct target genes in zebrafish, and in particular, is zswim5 (or other zswim genes) are direct targets. Since conservation of zswim4-6 as a direct BMP target between Nematostella and Xenopus seemed to be part of the rationale for further functional analysis, it would also be nice to know if this is a conserved target in zebrafish.

      Thank you for the suggestion. In the paper by Greenfeld et al., 2021, zebrafish zswim5 was downregulated approximately 2.4x in the bmp7 mutant at 6 hpf, while zswim6 was barely expressed and not affected at this stage. We added this information to the text of the manuscript. Expression of several other zebrafish zswim genes was also affected in the bmp7 mutant, but these genes do not appear relevant for our study since their corresponding orthologs are not identified as pSMAD1/5 ChIP-Seq targets in Nematostella. Notably, zebrafish zzswim5 is not clearly differentially expressed in BMP or Chd overexpression conditions (See Supplementary file 1 in Rogers et al. 2020). Importantly, in the paper, we wanted to compare ChiP-Seq data with ChIP-Seq data, however, unfortunately, no ChIP-Seq data for pSMAD1/5/8 is currently available for zebrafish, thus precluding comparisons.

      Related to this, in the discussion it is mentioned that zswim4/6 is also a direct BMP target in mouse hair follicle cells, but it wasn't obvious from looking at the supplemental data in that paper where this was drawn from.

      Please see Supplementary Table 1, second Excel sheet labeled “Mx ChIP_Seq” in Genander et al., 2014, https://doi.org/10.1016/j.stem.2014.09.009. Zswim4 has a single pSMAD1 peak associated with it, Zswim6 has two.

      2) The loss of zswim4-6 function via MO injection results in changes to pSmad1/5 staining, including a reduction in intensity in the endoderm and gain of intensity in the ectoderm, while over-expression results in a loss of intensity in the ectoderm and no apparent change in the endoderm. While this is interesting, it is not clear how zswim4-6 is functioning to modify BMP signaling, and how this might explain differential effects in ectoderm vs. endoderm. Is the assumption that the mechanism involves repression of chordin? And if so one could test the double knockdown of zswim4-6 and chordin and look for the rescue of pSad1/5 levels or morphological phenotype.

      We do not think that the mechanism of the ZSWIM4-6 action is via repression of Chordin. As loss of chordin leads to the loss of pSMAD1/5 in Nematostella (Genikhovich et al., 2015), the proposed experiment is, unfortunately, not feasible to test this hypothesis. Currently, we see two distinct effects of the modulation of zswim4-6 expression. First, it affects the pSMAD1/5 gradient, possibly by destabilizing nuclear SMAD1/5, as has been proposed by Wang et al., 2022 for the vertebrate Zswim4. This is in line with our results shown on Fig. 6C-F’ and Fig. 6-Figure supplement 3. In our opinion, the reaction of the genes expressed on the “high BMP” side of the directive axis to the overexpression or KD of ZSWIM4-6 (Fig. 6I-K’, 6N-P’) can be explained by these changes in the pSMAD1/5 signaling intensity. Secondly, zswim4-6 appears to promote pSMAD1/5-mediated gene repression. This is in line with the reaction of the genes expressed on the “low BMP” side of the directive axis (Fig. 6G-H’, 6L-M’, Fig. 6-Figure Supplement 4). These genes are repressed by BMP signaling, but they expand their expression upon zswim4-6 KD in spite of the increased pSMAD1/5. Our ChiP experiment (Fig. 6Q) supports this view.

      3) Several experiments are done to determine how zswim4-6 expression responds to the loss of function of different BMP ligands and receptors, with the conclusion being that swim4-6 is a BMP2/4 target but not a GDF5 target, with a lot of the discussion dedicated to this as well. However, the authors show a binary response to the loss of BMP2/4 function, where zswim4-6 is expressed normally until pSmad1/5 levels drop low enough, at which point expression is lost. Since the authors also show that GDF5 morphants do not have as strong a reduction in pSmad1/5 levels compared to BMP2/4 morphants, perhaps GDF5 plays a positive but redundant role in swim4-6 expression. To test this possibility the authors could inject suboptimal doses of BMP2/4 MO with GDF5 MO and look for synergy in the loss of zswim4-6 expression.

      Thanks for this great suggestion! We performed this experiment (Fig. 5H’’-L) and indeed, a suboptimal dose of BMP2/4MO + GDF5lMO results in a complete radialization of the embryo and abolished zswim4–6, similar to the effect of a high dose of BMP2/4. This result suggests that rather than being a ligand-specific signaling function, GDF5-like signaling alone still provides sufficiently high pSmad1/5 levels to activate zswim4-6 expression to apparent wildtype levels, demonstrating the sensitivity of this gene to even very low amounts of BMP signaling.

      4) The zswim4-6 morphant embryos show increased expression of zswim4-6 mRNA, which is said to indicate that zswim4-6 negatively regulates its own expression. However in zebrafish translation blocking MOs can sometimes stabilize target transcripts, causing an artifact that can be mistakenly assumed to be increased transcription (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7162184/). Some additional controls here would be warranted for making this conclusion.

      Thanks for raising this important experimental consideration. To-date, we do not have any evidence for MO-mediated transcript stabilization in Nematostella, and we have not found such data in the literature on models other than zebrafish. mRNA stabilization by the MO also seemed unlikely because we were unable to KD zswim4-6 using several independent shRNAs - an effect we frequently observe with genes, whose activity negatively regulates their own expression. However, to test the possibility that zswim4-6MO binding stabilizes zswim4-6 mRNA, we injected mRNA containing the zswim4-6MO recognition sequence followed by the mCherry coding sequence (zswim4-6MO-mCherrry) with either zswim4-6MO or control MO. We could clearly detect mCherry fluorescence at 1 dpf if control MO was co-injected with the mRNA, but not if zswim4-6MO was coninjected with the mRNA. At 2 dpf (the stage at which we showed upregulation of zswim4-6 upon zswim4-6MO injection on Fig. 6I-I’), zswim4-6MO-mCherrry mRNA was undetectable by in situ hybridization with our standard FITC-labeled mCherry probe independent of whether zswim4-6MO-mCherrry mRNA was co-injected with the control MO or ZSWIM4-6MO, while hybridization with the FITC-labeled FoxA probe worked perfectly.

      Author response image 1.

      We are currently offering two alternative hypothesis for the observed increase in zswim4-6 levels in the paper rather than stating explicitly that ZSWIM4-6 negatively regulates its own expression: “The KD of zswim4-6 translation resulted in a strong upregulation of zswim4-6 transcription, especially in the ectoderm, suggesting that ZSWIM4-6 might either act as its own transcriptional repressor or that zswim4-6 transcription reacts to the increased ectodermal pSMAD1/5 (Fig. 6I-I’).” Given the sensitivity of zswim4-6 to even the weakest pSMAD1/5 signal (zswim4/6 is expressed upon GDF5-like KD, which drastically reduces pSMAD1/5 signaling intensity (see Fig. 1 and 2 in Genikhovich et al., 2015, http://doi.org/10.1016/j.celrep.2015.02.035 and Fig. 6-Figure supplement 3 of this paper), the latter option (that it reacts to the increased ectodermal pSMAD1/5) is, in our opinion, clearly the more probable one.

      5) Zswim4-6 is proposed to be a co-repressor of pSmad1/5 targets based on the occupancy of zswim4-6 at the chordin BRE (which is normally repressed by BMP signaling) and lack of occupancy at the gremlin BRE (normally activated by BMP signaling). This is a promising preliminary result but is based only on the analysis of two genes. Since the authors identified BREs in other direct target genes, examining more genes would better support the model.

      We suggest that ZSWIM4-6 may be a co-repressor of pSMAD1/5 targets because it is a nuclear protein (Fig. 4G), whose knockdown results in the expansion of the ectodermal expression of several genes repressed by pSMAD1/5 in spite of the expansion of pSMAD1/5 itself (Fig. 6G-H’, 6L-M’, Fig. 6-Figure Supplement 4). Our limited ChIP analysis supports this idea by showing that ZSWIM4-6 is bound to the pSMAD1/5 site of chordin (repressed by pSMAD1/5) but not on gremlin (activated by pSMAD1/5). We agree that adding the analysis of more targets in order to challenge our hypothesis would be good. However, given technical limitations (having to inject many thousands of eggs with the EF1a::ZSWIM4-6-GFP plasmid in order to get enough nuclei to extract sufficient immunoprecipitated chromatin for qPCR on 3 genes (chordin, gremlin, GAPDH) for each biological replicate, it is currently unfortunately not feasible to test more genes. It will be of great interest for follow up studies to generate a knock-in line with tagged zswim4-6 to analyze target binding on a genome-wide scale. We stress in the discussion that currently the power of our conclusion is low.

      6) The rationale for further examination of zswim4-6 function in Nematostella was based in part on it being a conserved direct BMP target in Nematostella and Xenopus. The analysis of zebrafish zswim5 function however does not examine whether zswim5 is a BMP target gene (direct or indirect). BMP inhibition followed by an in situ hybridization for zswim5 would establish whether its expression is activated downstream of BMP.

      In the paper by Greenfeld et al., 2021, zebrafish zswim5 was downregulated approximately 2.4x in the bmp7 mutant at 6 hpf. However, this gene was not among the 57 genes, which were considered to be direct BMP targets because their expression was affected by bmp7 mRNA injection into cycloheximide-treated bmp7 mutants (Greenfeld et al., 2021). We added this information to the text of the manuscript.

      7) Although there is a reduction in pSmad1/5/9 staining in zebrafish injected with zswim5 mRNA, it is difficult to tell whether the resulting morphological phenotypes closely resemble zebrafish with BMP pathway mutations (such as bmp2b). More analysis is warranted here to determine whether stereotypical BMP loss of function phenotypes are observed, such as dorsalization of the mesoderm and loss of ventral tail fin.

      We agree, and we have tuned down all zebrafish arguments. Analyses of zswim5 mutants are currently ongoing.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript, the authors investigate the genes involved in the retention of eggs in Aedes aegypti females. They do so by identifying two candidate genes that are differentially expressed across the different reproductive phases and also show that the transcripts of those two genes are present in ovaries and in the proteome. Overall, I think this is interesting and impressive work that characterizes the function of those two specific protein-coding genes thoroughly. I also really enjoyed the figures. Although they were a bit packed, the visuals made it easy to follow the authors' arguments. I have a few concerns and suggested changes, listed below.

      1) These two genes/loci are definitely rapidly evolving. However, that does not automatically imply that positive selection has occurred in these genes. Clearly, you have demonstrated that these gene sequences might be important for fitness in Aedes aegypti. However, if these happen to be disordered proteins, then they would evolve rapidly, i.e., under fewer sequence constraints. In such a scenario, dN/dS values are likely to be high. Another possibility is that as these are expressed only in one tissue and most likely not expressed constitutively, they could be under relaxed constraints relative to all other genes in the genome. For instance, we know that average expression levels of protein-coding genes are highly correlated with their rate of molecular evolution (Drummond et al., 2005). Moreover, there have clearly been genome rearrangements and/or insertion/deletions in the studied gene sequences between closely- related species (as you have nicely shown), thus again dN/dS values will naturally be high. Thus, high values of dN/dS are neither surprising nor do they directly imply positive selection in this case. If the authors really want to investigate this further, they can use the McDonald Kreitman test (McDonald and Kreitman 1991) to ask if non- synonymous divergence is higher than expected. However, this test would require population-level data. Alternatively, the authors can simply discuss adaptation as a possibility along with the others suggested above. A discussion of alternative hypotheses is extremely important and must be clearly laid out.

      We agree with the reviewer’s point that rapid evolution is not the same as positive selection. We also agree with the reviewer’s point that McDonald-Kreitman test (MK test) is more powerful than dN/dS analysis. We took advantage of a large population dataset from Rose et al. 2020. After filtering the data, we kept 454 genomes for MK tests. We found both genes are marginally significant or insignificant (tweedledee p = 0.068; tweedledum p = 0.048), despite that these are small genes and have low Pn values. This suggests that it is likely the genes evolve under positive selection.

      In line with the reviewer’s suggestion, we performed another analysis using a large amount of population data. We asked if the SNP frequencies of tweedledee and tweedledum are correlated with environmental variables. We found that when compared to a distribution of 10,000 simulated genes with randomly-sampled genetic variants, both tweedledee and tweedledum showed significant correlation to multiple ecological variables reflecting climate variability, such as mean diurnal range, temperature seasonality, and precipitation seasonality (p<0.05). These results are now incorporated into the manuscript in Figure 5 and Figure 5 – Figure supplement 1.

      2) The authors show that the two genes under study are important for the retention of viable eggs. However, as these genes are close to two other conserved genes (scratch and peritrophin-like gene), it is unclear to me how it is possible to rule out the contribution of the conserved genes to the same phenotype. Is it possible that the CRISPR deletion leads to the disruption of expression of one of the other important genes nearby (i.e., in a scratch or peritrophin-like gene) as the deleted region could have included a promoter region for instance, which is causing the phenotype you observe? Since all of these genes are so close to each other, it is possible that they are co-regulated and that tweedledee and tweedledum and expressed and translated along with the scratch and peritrophin-like gene. Do we know whether their expression patterns diverge and that scratch and peritrophin-like genes do not play a role in the retention of viable eggs?

      This is a fair criticism; however, we think the chance that the phenotypes are caused by interrupting nearby genes is very low. First, peritrophin-like acts in the immune response, and scratch is a brain-biased transcription factor. Neither of the genes show expression in the ovary before or after blood feeding (TPM <1 or 2 are generally considered unexpressed, while scratch and peritrophin-like expression levels are overall lower than 0.1 TPM).

      This suggests that peritrophin-like and scratch are not likely to function in the ovary. Thus, although we cannot completely rule out the gene knockout impacts regulation of very distant genes, it is unlikely. Since the mounting evidence we show in this manuscript that tweedledee and tweedledum are highly translated in the ovary after blooding feeding, under the principle of parsimony, we expect the phenotypes came from knocking out the highly expressed and translated genes.

      Reviewer #2 (Public Review):

      This manuscript is overall quite convincing, presenting a well- thought-out approach to candidate gene detection and systemic follow- ups on two genes that meet their candidate gene criteria. There are several major claims made by the authors, and some have more compelling evidence than others, but in general, the conclusions are quite sound. My main issues stem from how the strategy to identify genes playing a role in egg retention success has led to very particular genes being examined, and so I question some of the elements of the discussion focusing on the rapid evolution and taxon- uniqueness of the identified genes. In short, while I believe the authors have demonstrated that tweedledee and tweedledum play an important role in egg retention, I'm not sure whether this study should be taken as evidence that taxon-specific or rapidly evolving genes, in general, are responsible for this adaptation, or simply play an important role in it.

      We have revised the paper to make it clearer that the focus is indeed on these two genes on not on the greater question of taxon-specific or rapidly-evolving genes.

      First, the authors present evidence that Aedes aegypti females can retain eggs when a source of fresh water is lacking, confirming that females are not attracted to human forearms while retaining eggs and that up to 70% of the retained eggs hatch after retaining them for nearly a week. This ability is likely an important adaptation that allows Aedes aegypti to thrive in a broad range of conditions. The data here seem fairly compelling.

      Based on this observation, the authors reason that genes responsible for the ability to retain eggs must: 1) be highly expressed in ovaries during retention, but not before or after. 2) be taxon-specific (as this behavior seems limited to Aedes aegypti). While this approach to enriching candidate genes has proven fruitful in this particular case, I'm not sure I agree with the authors' rationale. First, even genes at a low expression in the ovaries may be crucial to egg retention. Second, while egg-laying behavior is vastly varied in insects, I'm not sure focusing on taxon-restricted genes is necessary. It is entirely possible that many of the genes identified in Figure 2E play a crucial role in egg retention evolution. These are minor issues, but they are relevant to some later points made by the authors.

      We regret framing the discovery of tweedledee and tweedledum in the original submission using this somewhat artificial set of filtering criteria. The reality is that the genes caught our attention for their novel sequence, tight genetic linkage, and interesting expression profile. That really is the focus of the paper, not these other peripheral questions that have been the focus of attention of the reviews. We really do apologize for all of the confusion about what this paper is about.

      Nonetheless, the authors provide very compelling evidence that the two genes meeting their criteria - tweedledee and tweedledum, play an important role in egg retention. The genes seem to be expressed primarily in ovaries during egg retention (some observed expression in brain/testes is expected for any gene), and the proteins they code seem to be found in elevated quantities in both ovaries and hemolymph during and immediately after egg retention. RNA for the genes is detected in follicles within the ovary, and CRISPR knockouts of both the genes lead to a large decrease in egg viability post retention.

      My earlier qualms about their search strategy relay into some issues with Figure 4, which describes how the two genes are 1) taxon- restricted and 2) have evolved very rapidly. Neither of the two statements is unexpected given the authors' search strategy. Of course, the genes examined precisely for their lack of homologs do not have any homologs. Similarly, by limiting themselves to genes that show a lack of homology (i.e. low sequence similarity) to other genes as well as genes with high expression levels in the ovaries, a higher rate of evolution is almost inevitable to infer (as ovary expressed genes tend to evolve more rapidly in mosquitoes). I agree with the authors that inferences of the evolutionary history of these genes are quite difficult because of their uniqueness, and I especially appreciate their attempts to identify homologs (although I really dislike the term "conceptualog").

      We have removed our term “conceptualog” and replaced with the mor conventional “putative ortholog”

      This leads to my main (fairly minor) issue of the paper - the discussion on the evolutionary history of these genes and its implications (sections "Taxon-restricted genes underlie tailored adaptations in a diverse world" and "Evolutionary histories and catering to different natural histories"). As noted, inferring this history is very difficult because the authors have focused on two rapidly evolving, taxon-restricted genes. The analyses they have performed here definitely demonstrate that the genes play an important role in egg retention, however, they do not show that taxon-restricted genes play a disproportionate role in egg retention evolution. Indeed, the only data relevant to this point would be the proportion of genes in Figure 2E that are taxon-restricted (3/9), but I'm not sure what the null expectation for this proportion for highly expressed ovary genes is to begin with. Furthermore, the extremely rapid evolution of this gene makes it hard to judge how truly taxon-restricted it is. My own search of tweedle homologs identified multiple as previously having been predicted to be "Knr4/Smi1-like", and while no similar genes are located in a similar location in melanogaster, there is generally little synteny conservation in Drosophila (for instance Bhutkar et al 2008), so I'm unsure what can really be said about their evolutionary origins/lack of homologs in Drosophila.

      In short - the manuscript makes clear that tweedledee and tweedledum play an important role in egg retention in A. aegypti, nonetheless, it is not clear that this is a demonstration of how important taxon- restricted genes are to understanding the evolution of life-history strategies.

      Again, we should have never framed the paper the way we did in the original version. We make no claims whatsoever that taxon-restricted genes in general should play a role in this biology, only that the two candidate genes under study influence egg viability after extended retention. We hope that the framing is clearer in this revision.

    1. Author Response:

      Reviewer #1 (Public Review):

      This study sought to systematically identify the components and driving forces of transcriptome evolution in fungi that exhibit complex multicellularity (CM). The authors examined a series of parameters or expression signatures (i.e. natural antisense transcripts, allele-specific expression, RNA-editing) concluding that the best predictor of a gene behavior in the CM transcriptome was evolutionary age.

      Thus, the transcriptomes of fruiting bodies showed a distinct gene-age-related stratification, where it was possible to sort out genes related to general sexual processes from those likely linked to morphogenetic aspects of the CM fruiting bodies. Notably, their results did not support a developmental hourglass, which is the rather predominant hypothesis in metazoans, including some analysis in fungi.

      The studies involved analyses of new transcriptomic datasets for different developmental stages (and tissue types in some cases) of Pleurotus ostreatus and Pterula gracilis, as well as the analyses of existing datasets for other fungi.

      There are diverse interesting observations such as ones regarding Allele Specific Expression (ASE), suggesting that in P. ostreatus ASE mainly occurs due to cis-regulatory allele divergence, possibly in fast evolving genes that are not under strong selection constraints, such as ones grouped in youngest gene ages categories. In addition, a large number of conserved unannotated genes among CM-specific orthogroups highlights the rather cryptic nature of CM in fungi and raises as an important area for future research.

      Some of the key aspects of the analyses would need to be better exemplified such as:

      – Providing a better description of the developmentally expressed TFs only in CM species

      – Providing clear examples of the promoter divergence that could be the underlying mechanism behind ASE. In particular, for some cases, there may be enough information in the literature/databases to predict the appearance or disappearance of relevant cis-elements in the promoters showing the highest divergence in genes depicting the highest levels of ASE.

      We appreciate the constructive comments of the Reviewer and have revised the ms in accordance with the suggestions. In particular, we link different parts of the ms better to each other, provided a more detailed discussion of developmentally expressed TFs (lines 615-621). We also provide case studies of ASE genes with cis-regulatory divergence (Figure 5 and see below), although we note that these analyses are based on inferred and not directly determined motifs, so they should be considered as preliminary.

      We had considered using TF binding motifs previously, and now we gave a try to analyzing potential transcription factor binding sites in divergent promoters. We find that there are no P. ostreatus transcription factors for which motifs based on direct evidence are available; rather, all P. ostreatus motifs are based on extrapolations from experimentally determined motifs (typically in Neurospora crassa). Therefore, to avoid too general motifs, we used only those where at least 5 nucleotides show at least 80% expected frequency in the PWM-s. This left us with 158 motifs (126 excluded). High motif binding score (>=4) and self-rate (>=0.9) were also required to ignore false positive hits. Different binding ability and lack of binding in one of the parental genomes were counted for each promoter. We found that genes with allele specific expression (ASE S2 and S4) show significantly higher differences in motif binding (lacking motifs, or different binding ability) than non-ASE genes (Fig. A1). These observations show that, not only promoter divergence, but differential predicted TF binding ability is also more common among ASE genes than among non-ASE genes. This supports our conjecture that ASE arises from cis-regulatory divergence.

      Fig A1: The left plot below shows the number of cases when the promoter of one allele of an allele pair in the two parent genomes has, but the other lacks a motif. The right plot shows the same in terms of difference in binding score.

      We could find examples, such as the allele specific expression of PleosPC15_2_1031042, a Hemerythrin-like (IPR012312) protein which might be regulated by the conserved c2h2 transcription factor, containing zinc finger domain of the C2H2 type (Fig. A2). C2h2 has already been proved to be important during the initiation of primordia formation with targeted gene inactivation (Ohm et al 2011, https://pubmed.ncbi.nlm.nih.gov/21815946/). A binding site of c2h2 was detected in the upstream region of PleosPC15_2_1031042. There is a mismatch in the inferred binding motif which causes a reduced binding score in PC15 (Fig. A2/c). Indeed the PC9 nuclei contribute better to the total expression of this gene.

      Despite this, and other (not shown) examples that we have found, we were not convinced about the reliability of this approach. There are many assumptions in this analysis, the positional weight matrices (PWM) that we used, are all based on indirect evidence, high number of loci these PWMs identify, uncertainty in the position of binding site from transcriptional start site, relation of difference in binding motif and expressional changes. We consider these factors to potentially contribute too much noise to the analyses for these to be robust, therefore, we are hesitant to include these results in the ms.

      Fig A2: An example for promoter divergence a) expression of c2h2 transcription factor (TF) in P. ostreatus. b) allele-specific expression pattern of PleosPC15_2_1031042 from the two parental genomes. c) inferred binding motif of c2h2 TF and a detected potential binding site in the upstream region of PleosPC15_2_1031042 gene.

      Reviewer #2 (Public Review):

      The evolution of complex multicellularity represents a major developmental reprogramming, and comparing related species which differ in multicellular structures may shed light on the mechanisms involved. Here, the authors compare species of Basidiomycete fungi and focus on analyzing developmental transcriptomes to identify patterns across species. Deep RNA-Seq data is generated for two species, P. ostreatus and Pt. gracilis, sampling different developmental stages. The authors report conflicting evidence for a "developmental hourglass" using a weighted transcription index vs gene age categories. There is substantial allele-specific expression in P. ostreatus, and these genes tend to have a more recent origin, have more divergent upstream regions and coding sequences, and are enriched for developmentally regulated transcripts. Antisense transcripts have low overlap with coding regions and low conservation, and a subset show a positive or negative correlation with the overlapping gene. Comparison to a species without complex multicellular development is used to further classify the developmental program.

      Overall the new transcriptional data and extensive analysis provide a thorough view of the types of transcripts that appear differentially regulated, their age, and associated gene function enrichment. The gene sets identified from this analysis as well as the potential to re-analyze this data will be useful to the community studying multicellularity in fungi. The primary insights drawn in this study relate to the dating of the developmental transcriptome, however some patterns observed with young genes and noncoding transcripts are primarily reflective of expected patterns of evolutionary time.

      We appreciate the Reviewer’s nice words on our ms, we think the revised version has substantial improvements in many aspects listed above.

      Reviewer #3 (Public Review):

      Fungi are unique in forming complex 3D multicellular reproductive structures from 2D mycelium filaments, a property used in this paper to study the genetic changes associated with the evolution of complex 3D multicellularity. The manuscript by Merenyi et al. investigates the evolution of gene expression and genome regulation during the formation of reproductive structures (fruiting bodies) in the Agaricomycetes lineage of Fungi. Transcriptome and multicellularity evolution are very exciting fundamental questions in biology that only become accessible with recent technological developments and the appropriate analysis framework. Important perspectives include understanding how genes acquire new functions and what role plays transcriptional regulation in adaptation. The study gathers a very useful dataset to this end, and relies on generally relevant hypotheses-driven analyses.

      Analysis of fruiting body transcriptome in nine species revealed that prediction from the development hourglass model (that young genes are expressed in early and late but not intermediate phases of development) verified only in a few species, including Pleurotus ostreatus. An allele-specific expression (ASE) analysis in P. ostreatus showed that young genes frequently show ASE during fruiting body development. A comparative analysis with C. neoformans, which reproduces sexually without forming fruiting body, indicates that young and old (but not intermediate) genes are likely involved specifically in fruiting body morphogenesis. A number of underlying hypothesis could be presented better, and importantly the connection between the various analyses did not appear obvious to me. Some hypotheses and reasoning therefore need clarification. Some important results from the analyses are not provided and not commented, although they are required to fully meet the manuscript's objectives.

      We appreciate the Reviewer’s suggestions and have revised the ms as explained below.

      1. I do not clearly see the connection between the developmental hourglass model studied in the first part of the ms, and the allele-specific expression patterns in the second half of the ms. Which "phase" of the hourglass is expected to contain true CM-related genes (by contrast to general sexual processes)? Was P. ostreatus chosen for the ASE analysis because evidence for a developmental hourglass pattern was detected in this species? The conclusion that "evolutionary age predicts, to a large extent, the behaviour of a gene in the CM transcriptome" was established thanks to ASE in P. ostreatus, which was also found to be rather an exception for conforming to the hourglass model of developmental evolution. To what extent is this conclusion transferable to other Agaricomycete/fungal species?

      We chose P. ostreatus because this is the only species for which the genomes of both parental strains (PC9 and PC15) are available. Although the hourglass concept is indeed a central hypothesis in animal developmental biology (though see recent critiques some (Piasecka et al 2013), our results suggest that it simply does not generally apply to fungal development. This may be due to the unique developmental mechanisms of fungi, or the independent origin(s) of CM in fungi. Our ms might have been misleading in this respect, in the revision we clarify that the ASE and hourglass analyses are independent of each other. Our interpretation of the hourglass results is that this model is not or hardly applicable for fungal development and the fact that P. ostreatus was the only species that in fact showed an hourglass did not drive our selection of this species. We inserted a note on this in the ms.

      1. The authors acknowledge that fruiting body-expressed genes may relate either to CM or to more general sexual functions, and that disentangling these functions is a major challenge in their study. An overview of which gene was assigned to which function is not explicit in the ms (proposed to be described in a separate publication). Do these functional gene classes show distinct transcriptome evolution patterns (hourglass model, ASE...)?

      We made accessible the complete list of CM-related genes and genes with more general sexual functions in Table S2/b-c. Due to length restrictions, we do not discuss many or each of these genes here, but provided gene ontology-based overviews (Fig 8/c-d, from lines 631). To answer the question whether CM vs shared genes show distinct transcriptomic patterns, we analyzed ASE, NATs and the hourglass model separately for CM-specific and shared genes. as follows:

      -hourglass: We calculated and visualised the TAI for CM-specific and Shared gene sets of P. ostreatus separately. The average value of TAI decreased a lot in Shared genes possibly due to the overrepresentation of ancient genes here, but the patterns remained similar to the original, which imply that not simply one or the other gene set drives these patterns (Fig A3).

      Fig A3: Transcriptome Age Index for CM-specific and Shared gene sets of P. ostreatus separately

      -ASE: As we detailed in the ms, allele specific expression occurs mainly in young genes. Indeed, only 13.1% of ASE genes belong to the conserved gene sets (CMspecific: 200 and Shared: 144). Although there are more ASE genes (>2FC) among CM-specific genes, they are still underrepresented compared to young genes that are neither shared, nor CM-specific. This indicates that ASE is generally a feature of non-conserved genes and is not particularly characteristic for either conserved or CM-specific genes.

      -NAT: We found that 17.3% of CM-specific (141 genes) and 18.3% of Shared genes (165 genes) overlap with antisense transcripts. Since these numbers don't differ substantially from 17.6%, which is the proportion of NATs corresponding to all protein coding genes, it implies an independent occurrence between NATs and these gene conservation groups.

      3.) As far as I understand, major functions of the fruiting body transcriptome are either CM or general sexual functions. Could these genes, notably those showing ASE, play a role in general processes other than sexual development (hyphal growth, environment sensing, cell homeostasis, pathogenicity)?

      Certainly, ASE might also occur in genes related to these processes. However, the processes mentioned by the Reviewer are likely associated with very conserved genes (except pathogenicity, which we can’t examine here) and our results suggest that ASE is more typical of young genes that are under weak selection. We detected ASE in 931/343 (S2/S4 genes) genes expressed in the vegetative mycelium stage of P. ostreatus. We also note that by the definition of developmentally regulated genes, we do not expect very basic fungal processes, like hyphal growth to be among the functions of the genes we identified. Genes related to such basic (housekeeping) processes usually (exceptions exist) show flat expression profiles (because they are equally important in mycelia and all fruiting body stages) and will not be picked up by our pipelines for identifying shared developmentally regulated genes.

      1. As stated by the authors, "the goal of this study was to systematically tease apart the components and driving forces of transcriptome evolution in CM fungi". What drives the interesting ASE pattern discovered however remains an open question at the end of the ms. The authors appropriately discuss that these patterns could be either adaptive or neutral but there is no direct evidence for any scenario in P. ostreatus. Is the expression of (some of) the young genes showing ASE required for CM? one or two case studies would allow providing support for such scenarios.

      We respectfully disagree. We provide evidence that the driving force of ASE is promoter divergence (and consequently differential transcription factor binding) in genes in which it is tolerated (see conclusions, lines 708-712). Our results suggest that ASE is mostly a neutrally arising phenomenon. To get to the mechanistic bases of how promoter divergence can cause ASE (following the suggestion of Reviewer 1), we analysed putative, inferred transcription factor binding motifs in P. ostreatus and found that ASE genes had more divergence in putative TF binding sites. However, it is important to emphasise that all TF motifs we analyzed are inferred motifs and therefore these results are indicative at best.

      Reviewer #4 (Public Review):

      This work develops a comparative framework to test genes which support complex morphological structures and complex multicelluarity. This expands beyond simple gene sharing and phylogenomics by incorporating comparison of gene expression profiling of development of multicellular structures during sexual reproduction. This approach tests the hypothesis that genes underlying sexual reproductive structure formation are homologous and the molecular evolutionary processes that control transcriptome evolution which underlie complex multicellularity.

      The approaches used are appropriate and employ modern comparative and transcriptome analyses to example allele specific expression, and evaluate an age of the evolutionary ages of genes. This work produced additional new RNAseq to examine developmental processes and combined it with existing published data to contrast fungi with either complex morphologies or yeast forms.

      The strengths of work are well selected comparison organisms and efforts to have developmental stages which are appropriate comparisons.

      We appreciate the Reviewer’s positive comments.

      Weakness could be pointed to in how the NAT descriptions are interesting it isn't clear how they link directly to morphology variation or development. I am unclear if these are arising from new de novo promotors, are ferried by transposable elements, or if any other understanding of their genesis indicates they are more than very recent gains in a species for the most part and not part of any conserved developmental process (outside a few exemplars).

      Originally, we assayed natural antisense transcripts (NAT) based on the assumption that they regulate developmental processes (e.g. Kim et al 2018 https://doi.org/10.1128/mBio.01292-18). Our analyses showed that although NATs are abundant in CM transcriptomes of fungi, they show no homology across species and so are unlikely to drive conserved developmental processes, which we are after in this ms. Rather, our data are compatible with most (but likely not all) NATs being transcriptional noise, arising from novel or random promoters. We therefore shortened this section and moved much of it to the Appendix 1.

      The impact of this work will reside in how gene age intersects with variability and relative importance in CM. it will be interesting to see future work examine the functions of these genes and test how allele specific expression and specific alleles are contributing to the formation of these tissues and growth forms. I am still not sure if molecular mechanisms of how high variability in gene expression is still producing relatively uniform morphologies, or if it isn't quantification of morphological variation would be nice to link to whether ASE underlie that.

      We agree that allele specific expression could influence morphologies significantly, but investigating that is beyond the scope of the current work (it would require a population genomics project). More direct evidence on allelic differences can be seen in monokaryon phenotypes, which only express one of the parental alleles. Phenotypic differences are obvious in the mycelium of the two parental monokaryons : the mycelium of PC9 is more fluffy and grows faster than that of PC15. This was reported recently by Lee et al 2021 (https://doi.org/10.1093/g3journal/jkaa008). We agree with the Reviewer that this is a very exciting future research direction.

      To my read of the work, the authors achieved their goals and confirmed hypothesis about the age of genes and the variability of gene expression. I still feel there is some clarity lost in whether the findings across the large number of species compared here help inform predictions or classifications of types of genes which either have ASE or are implicated in CM. This is really work for the future as the authors have provided a detailed analysis and approach that can fuel further direction in this research area.

      To address this issue we reworked the ms to make connections between ASE and CM clearer. Because ASE appears based on our results to (mostly) arise neutrally, predictions for other species are expected to be hard. On the other hand, we think we can make confident predictions on what types of genes are implicated in CM in other species, at least for conserved aspects of fruiting body development.

    1. Author Response

      Reviewer #1 (Public Review):

      In this study, the authors set out to clarify the relationship between brain oscillations and different levels of speech (syllables, words, phrases) using MEG. They presented word lists and sentences and used task instructions to attempt to focus listeners' attention on different levels of linguistic analysis (syllables, words, phrases).

      1) I came away with mixed feelings about the task design: following each stimulus (sentence or word list), participants were asked to (a) press a button (i.e. nothing related to what they heard, (b) indicate which of two syllables was heard, (c) indicate which of two words was heard, (d) indicate which pair of words was present in the correct order. This task is the critical manipulation in the study, as it is intended to encourage (or in the authors' words, "require") participants to focus on different timescales of speech (syllable, word, and phrase, respectively). I very much like the idea of keeping the physical stimuli unchanged, and manipulating attention through task demands - an elegant and effective approach. At the same time, I have reservations about the degree to which these task instructions altered attention during listening. My intuition is that, if I were a participant, I would just listen attentively, and then answer the question about the specific level. For example, I don't know that knowing I would be doing a "word pair" task, I would be attending at a slower rate than a "word" task, as in both cases I would be motivated to understand all of the words in the sentence. I fully acknowledge my introspection (n=1) may be flawed here, but nevertheless, any additional support validating the effect of these instructions would help the interpretation of the MEG results.

      The reviewer points out that to do any task on sentences (such as a word task and a syllable task) participants’ strategy could be to understand the full meaning of the sentence and infer the lower level properties based on the understanding of the full sentence. We fully share this introspection, which would suggest that extracting sentence meaning is partly automatic (or at least a default mode of processing) and independent of the behavioral relevance. While the reviewer sees this as a downside of the design, this is part of what our study tried to disentangle (automatic versus task-dependent processing at lower frequency time-scales). If, as the reviewer points out, all processing of sentences would be automatic we should not find any effect of task (as the task should not affect the tracking response at all). We found that overall the tracking response is robust to task-induced manipulation of attention – the main effect that MI to phrases is higher for sentences than for word lists is robust across passive and task conditions. But that is not the whole story on the source level, where we do find some task effects, which indicates that task instructions do matter. This means that participants changed their strategy depending on the instructions, but that overall, tracking of linguistic structures such as phrases is automatic. We show that for the IFG MI phrasal time scales are tracked stronger during the phrase task versus the other tasks. This is also reflected in stronger STG-IFG connectivity during the phrasal versus passive task. These results speak against the interpretation of the reviewer that “task instructions“ do not “ altered attention during listening”. While there are these subtle task differences, especially in IFG, overall our findings do speak for an automatic tracking of phrasal rate structure in sentences independent of task. We therefore concluded that “automatic understanding of linguistic information, and all the processing that this entails, cannot be countered to substantially change the consequences for neural readout, even when explicitly instructing participants to pay attention to particular time-scales” (line 548-549).

      The analysis steps generally seem sensible and well-suited to answering the main claims of the study. Controlling for power differences between conditions through matching was a nice feature.

      2) I had a concern about accuracy differences (as seen in Figure 1) across stimulus materials and tasks. In particular, for the phrase task, participants were more accurate for sentence stimuli than word list stimuli. I think this makes a lot of sense, as a coherent sentence will be easier to remember in order than a list of words. But, I did not see accuracy taken into account in any of the analyses. These behavioral differences raise the possibility that the MEG results related to the sentence > word list contrast in phrases (which seems one of the most interesting findings in IFG) simply reflect differences in accuracy.

      With the caveat of the concern regarding accuracy differences, the research goals were clear and the conclusions were generally supported by the analyses.

      Thank you for pointing this out. We have now taken accuracy into account in our analysis. It did not change any of our main findings or conclusions, and strengthened the argument that tracking of phrases in sentences vs. word lists is stronger. The influence of task difficulty is a relevant point to investigate (also see point 1 of reviewer 2 and point 4 of reviewer 3). To do so we added accuracy (per participant per condition) as a factor in the mixed model (as well as all interactions with task and condition) for the MI, power, and connectivity analyses at the phrasal rate/delta band. Note that as for the passive task there is no accuracy, we removed the passive task from the analyses. We could also only run models with random intercepts (not random slopes), due to the reduced number of degrees of freedom when adding the factor accuracy to the models.

      For the MI analysis we only found an effect in MTG. Specifically, there was a three-way interaction between task, condition and accuracy (F(2, 91.9) = 3.4591, p = 0.036). To follow up on this three-way interaction we split the data per task. The condition*accuracy interaction was only (uncorrected) significant for the word combination task (F(1,24.8) = 5.296, p = 0.03 (uncorrected)) and not for any other task (p>0.1). In the word combination task, we found that the difference between sentences and word lists was the strongest at high accuracies (see below figure the predicted values of the model). One way to interpret this finding is that stronger phrasal-rate MI tracking in MTG promotes phrasal-rate processing (as indicated by accuracy) more in sentences than in word lists.

      MEG – behavioral performance relation. A) Predicted values for the phrasal band MI in the MTG for the word combination task separately for the two conditions. B) Predicted values for the delta band WPLI in the STG-MTG connection separately for the two conditions. Error bars indicate the 95% confidence interval of the fit. Colored lines at the bottom indicate individual datapoints.

      For power we did not find any effect of accuracy. For the connectivity analysis we found in the STG-MTG connectivity a significant conditionaccuracy interaction (F(1, 80.23)=5.19, p = 0.025). The conditionaccuracy interaction showed that lower accuracies were generally associated with stronger differences between the sentences and word lists (see figure; the opposite of the MI analysis). Thus, functional connections in the delta band are stronger during sentence processing when participants have difficulty with the task (independent of the task performed). This could indicate that low-frequency connections are more relevant for the sentence than the word list condition (as the reviewer also indicated in point 1).

      After correcting for accuracy there was also a significant task condition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005 corrected), but not for the other tasks (p>0.1).

      We added the results of the accuracy analyses in the main manuscript as well as adding a dedicated section in our discussion section (page 21-22). Adding accuracy did not remove any of the effects we report in the original analyses. Therefore, none of these finding change the interpretation of the results as the task still had an influence on the MI responses of MTG and IFG. The effect of accuracy in the MTG refined the results showing that the effect was strongest there for participants with high accuracies. This relationship suggests a functional role of tracking through phase alignment for understanding phrasal structure.

      The methods now read: “MEG-behavioural performance analysis: To investigate the relation between the MEG measures and the behavioural performance we repeated the analyses (MI, power, and connectivity) but added accuracy as a factor (together with the interactions with the task and condition factor). As there is no accuracy for the passive task, we removed this task from the analysis. We then followed the same analyse steps as before. Since we reduced our degree of freedom, we could however only create random intercept and not random slope models”.

      The results now read: “MEG-behavioural performance relation. We found for the MI analysis a significant effect of accuracy only in the MTG. Here, we found a three-way interaction between accuracy task condition (F(2, 91.9) = 3.459, p = 0.036). Splitting up for the three different tasks we found only an uncorrected significant effect for the condition accuracy interaction for the phrasal task (F(1, 24.8) = 5.296, p = 0.03) and not for the other two tasks (p>0.1). In the phrasal task, we found that when accuracy was high, there was a stronger difference between the sentence and the word list condition compared to when accuracy was low, with stronger accuracy for the sentence condition (Figure 7A).

      No relation between accuracy and power was found. For the connectivity analysis we found a significant condition accuracy interaction for the STG-MTG connection (F(1,80.23) = 5.19, p = 0.025; Figure 7B). Independent of task, when accuracy was low the difference between sentence and word lists was stronger with higher WPLI fits for the sentence condition. After correcting for accuracy there was also a significant task condition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005), but not for the other tasks (p>0.1).”

      The discussion now reads: “We found that across participants both the MI and the connectivity in temporal cortex influenced behavioural performance. Specifically, MTG-STG connections were, independent of task, related to accuracy. There was higher connectivity between MTG and STG for sentences compared to word lists at low accuracies. At high accuracies, we found that stronger MTG tracking at phrasal rates (measured with MI) for sentences compared to word lists during the word combination task. These results suggest that indeed tracking of phrasal structure in MTG is relevant to understand sentences compared to word lists. This was reflected in a general increase in delta connectivity differences when the task was difficult (Figure 7B). Participants might compensate for the difficulty using phrasal structure present in the sentence condition. When phrasal structure in sentences are accurately tracked (as measured with MI) performance is better when these rates are relevant (Figure 7A). These results point to a role for phrasal tracking for accurately understanding the higher order linguistic structure in sentences even though more research is needed to verify this. It is evident that the connectivity and tracking correlations to behaviour do not explain all variation in the behavioural performance (compare Figure 1 with 3). Plainly, temporal tracking does not explain everything in language processing. Besides tracking there are many other components important for our designated tasks, such as memory load and semantic context which are not captured by our current analyses.”

      Reviewer #2 (Public Review):

      In a MEG study, the authors investigate as their main question whether neural tracking at the phrasal time scale reflects linguistic structure building (testing different conditions: sentences vs. word-lists) or an attentional focus on the phrasal time scale (testing different tasks, passive listening, syllable task, word task, word combination/phrasal scale task). They perform the following analyses at brain areas (ROIs: STG, IFG, MTG) of the language network: (1) Mutual information (MI) between the acoustic envelope and the delta band neuronal signals is analyzed. (2) Power in the delta band is analyzed. (3) Connectivity is analyzed using debiased WPLI. For all analyses, linear mixed-models are separately conducted for each ROI. The main finding is that the sentence compared to the word-list condition is more strongly tracked at the phrasal scale (MI). In STG the effect was task-independent; in MTG the effect only occurred for active tasks; and in IFG additionally, the word-combining/phrasal scale task resulted in higher tracking compared to all other tasks. The authors conclude that phrasal scale neural tracking reflects linguistic processing which takes place automatically, while task-related attention contributes additionally at IFG (interpreted as combinatorial hub involved in language and non-language processing). The findings are stable when power differences are controlled. The connectivity analysis showed increased connectivity in the delta band (phrasal time scale) between IFG-STG in the phrasal-scale compared to the passive task (adding to the IFG MI findings). (Additionally, they separately analyze neural tracking at the syllabic and word time scale, which however is not in the main focus).

      Major strength/weaknesses of the methods and results:

      1) A major strength of the results is that part of them replicate the authors' earlier findings (i.e. higher tracking at the phrasal time scale for sentences compared to word-lists; Kaufeld et al., 2020), while they complement this earlier work by showing that the effects are due to linguistic processing and not to an attentional focus on the phrasal time scale due to the task (at least in STG and MTG; while the task plays a role for the IFG tracking). Another strength is that a power control analysis is applied, which allows excluding spurious results due to condition differences in power. A weakness of the method is that analyses were applied separately per ROI, and combined across correct/incorrect trials (if I understood correctly), no trial-based analysis was conducted (which is related to how MI is computed). Furthermore, several methodological details could be clarified in the manuscript.

      The authors achieved their aims by providing evidence that neuronal tracking at the phrasal time scale in STG and MTG depends on the presence of linguistic information at this scale rather than indicating an attentional focus on this time scale due to a specific task. Their results support the conclusion. Results would be strengthened by showing that these effects are not impacted by different amounts of correct/incorrect trials across conditions (if I understood that correctly).

      We thank the reviewer for her comments. It is correct that we collapsed across the correct and incorrect trials. This had various reasons (also see point 2 and 9 of reviewer 1 and point 4 of reviewer 3). First, our tasks function solely to direct participants’ attention to the various linguistic representations (syllables, words, phrases) and the timescales that they occur on. The three tasks are in a sense more control tasks to study the tracking response, and manipulate attention as tracking during spoken language comprehension occurs, rather than a case where the neural response to the tasks is itself to be studied. For example, in a typical working memory paradigm, it is only during correct trials that the relevant cognitive process occurs. In contrast, in our paradigm, it is likely that that spoken stimuli are heard and processing, in other words, sentence comprehension and word list perception occur, even during incorrect trials in the syllable condition. As such, we do not expect MI tracking responses to explain the behavioral data. However, we agree it is crucially important to show that MI differences are not a function of task performance differences.

      Second, there are clear differences in difficulty level of the trials within conditions. For example, if the target question was related to the last part of the audio fragment, the task was much easier than when it was at the beginning of the audio fragment. In the syllable task, if syllables also were (by chance) a part-word, the trial was also much easier. If we were to split up in correct and incorrect we would not really infer solely processes due to accurately processing the speech fragments, but also confounded the analysis by the individual difficulty level of the trials.

      To acknowledge this, we added this limitation to the methods. The methods now reads: “Note that different trials within a task were not matched for task difficulty. For example, in the syllable task syllables that make a word are much easier to recognize than syllables that do not make a word. Additionally, trials pertaining to the beginning of the sentence are more difficult than ones related to the end of the sentence due to recency effects.”.

      To still investigate if overall accuracy influenced the results we did add accuracy (across participants) to the mixed models. Note that as for the passive task there is no accuracy, we removed the passive task from the analyses. We could also only run models with random intercepts (not random slopes), due to the reduced number of degrees of freedom when adding the factor accuracy to the models.

      For the MI analysis we only found an effect in MTG. Specifically, there was a three-way interaction between task, condition and accuracy (F(2, 91.9) = 3.4591, p = 0.036). To follow up on this three-way interaction we split the data per task. The condition*accuracy interaction was only (uncorrected) significant for the word combination task (F(1,24.8) = 5.296, p = 0.03 (uncorrected)) and not for any other task (p>0.1). In the word combination task, we found that the difference between sentences and word lists was the strongest at high accuracies (see on the right attached figure the predicted values of the model). One way to interpret this finding is that stronger phrasal-rate MI tracking in MTG promotes phrasal-rate processing (as indicated by accuracy) more in sentences than in word lists.

      For power we did not find any effect of accuracy. For the connectivity analysis we found in the STG-MTG connectivity a significant conditionaccuracy interaction (F(1, 80.23)=5.19, p = 0.025). The conditionaccuracy interaction showed that lower accuracies were generally associated with stronger differences between the sentences and word lists (see figure below; the opposite of the MI analysis). Thus, functional connections in the delta band are stronger during sentence processing when participants have difficulty with the task (independent of the task performed). This could indicate that low-frequency connections are more relevant for the sentence than the word list condition.

      MEG – behavioral performance relation. A) Predicted values for the phrasal band MI in the MTG for the word combination task separately for the two conditions. B) Predicted values for the delta band WPLI in the STG-MTG connection separately for the two conditions. Error bars indicate the 95% confidence interval of the fit. Colored lines at the bottom indicate individual datapoints.

      After correcting for accuracy there was also a significant task*condition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005 corrected), but not for the other tasks (p>0.1).

      We added the results of the accuracy analyses in the main manuscript as well as adding a dedicated section in our discussion section (page 21-22). Adding accuracy did not remove any of the effects we report in the original analyses. Therefore, none of these finding change the interpretation of the results as the task still had an influence on the MI responses of MTG and IFG. The effect of accuracy in the MTG refined the results showing that the effect was strongest there for participants with high accuracies. This relationship suggests a functional role of tracking through phase alignment for understanding phrasal structure.

      The methods now read: “MEG-behavioural performance analysis: To investigate the relation between the MEG measures and the behavioural performance we repeated the analyses (MI, power, and connectivity) but added accuracy as a factor (together with the interactions with the task and condition factor). As there is no accuracy for the passive task, we removed this task from the analysis. We then followed the same analyse steps as before. Since we reduced our degree of freedom, we could however only create random intercept and not random slope models”.

      The results now read: “MEG-behavioural performance relation. We found for the MI analysis a significant effect of accuracy only in the MTG. Here, we found a three-way interaction between accuracytaskcondition (F(2, 91.9) = 3.459, p = 0.036). Splitting up for the three different tasks we found only an uncorrected significant effect for the condition*accuracy interaction for the phrasal task (F(1, 24.8) = 5.296, p = 0.03) and not for the other two tasks (p>0.1). In the phrasal task, we found that when accuracy was high, there was a stronger difference between the sentence and the word list condition compared to when accuracy was low, with stronger accuracy for the sentence condition (Figure 7A).

      No relation between accuracy and power was found. For the connectivity analysis we found a significant conditionaccuracy interaction for the STG-MTG connection (F(1,80.23) = 5.19, p = 0.025; Figure 7B). Independent of task, when accuracy was low the difference between sentence and word lists was stronger with higher WPLI fits for the sentence condition. After correcting for accuracy there was also a significant taskcondition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005), but not for the other tasks (p>0.1).”

      The discussion now reads: “We found that across participants both the MI and the connectivity in temporal cortex influenced behavioural performance. Specifically, MTG-STG connections were, independent of task, related to accuracy. There was higher connectivity between MTG and STG for sentences compared to word lists at low accuracies. At high accuracies, we found that stronger MTG tracking at phrasal rates (measured with MI) for sentences compared to word lists during the word combination task. These results suggest that indeed tracking of phrasal structure in MTG is relevant to understand sentences compared to word lists. This was reflected in a general increase in delta connectivity differences when the task was difficult (Figure 7B). Participants might compensate for the difficulty using phrasal structure present in the sentence condition. When phrasal structure in sentences are accurately tracked (as measured with MI) performance is better when these rates are relevant (Figure 7A). These results point to a role for phrasal tracking for accurately understanding the higher order linguistic structure in sentences even though more research is needed to verify this. It is evident that the connectivity and tracking correlations to behaviour do not explain all variation in the behavioural performance (compare Figure 1 with 3). Plainly, temporal tracking does not explain everything in language processing. Besides tracking there are many other components important for our designated tasks, such as memory load and semantic context which are not captured by our current analyses.”

      The findings are an important contribution to the ongoing debate in the field whether neuronal tracking at the phrasal time scale indicates linguistic structure processing or more general processes (e.g. chunking).

      Reviewer #3 (Public Review):

      This manuscript presents a MEG study aiming to investigate whether neural tracking of phrasal timescales depends on automatic language processing or specific tasks related to temporal attention. The authors collected MEG data of 20 participants as they listened to naturally spoken sentences or word lists during four different tasks (passive listening vs. syllable task vs. word tasks vs. phrase task). Based on mutual information and Connectivity analysis, the authors found that (1) neural tracking at the phrasal band (0.8-1.1 Hz) was significantly stronger for the sentence condition compared to the word list condition across the classical language network, i.e., STG, MTG, and IFG; (2) neural tracking at the phrasal band was (at least tend significantly) stronger for phrase task than other tasks in the IFG; (3) the IFG-STG connectivity was increased in the delta-band for the phrase task. Ultimately, the authors concluded that neural tracking of phrasal timescales relied on both automatic language processing and specific tasks.

      Overall, this study is trying to tackle an interesting question related to the contributing factors for neural tracking of linguistic structures. The study procedure and analyses are well executed, and the conclusions of this paper are mostly well supported by data. However, I do have several major concerns.

      1. The title of the manuscript uses the description "tracking of hierarchical linguistic structure". In general, hierarchical linguistic structures involve multiple linguistic units with different timescales, such as syllables, words, phrases, and sentences. In this study, however, the main analysis only focused on the phrasal band (0.8-1.1 Hz). It seemed that there was no significant stimulus- or task-effect on the word band or syllabic band (supplementary figures). Therefore, it is highly recommended that the authors modify the related descriptions, or explain why neural tracking of phrases can represent neural tracking of hierarchical linguistic structures in the current study.

      We thank the reviewer for this comment. We meant to refer to the task manipulation directing attention to different levels of representation across the linguistic hierarchy. We have changed the title to “Neural tracking of phrases during spoken language comprehension is automatic and task-dependent.” We hope this resolves any inadvertent confusion we created. Furthermore, throughout the manuscript we ensure to talk about effect occurring for phrasal tracking at low frequency bands at not across any hierarchical linguistic structure. We agree that our findings cannot speak for any task-dependent effects along the hierarchy, only that at the phrasal level there is a difference between sentences and word lists.

      1. In Methods, the authors employed MI analyses on three frequency bands: 0.8-1.1 Hz for the phrasal band, 1.9-2.8 Hz for the word band, and 3.5-5.0 Hz for the syllabic band (line 191-192). As the timescales of linguistic units are various and overlapped in natural speech, I wonder how the authors define the boundaries of these frequency bands, and whether these bands are proper for the naturally spoken stimuli in the current study. These important details should be clarified.

      The frequency bands of the MI analysis were based on the stimuli, or in other words, are data driven. They reflect the syllabic, word, and phrasal rates in our stimulus set (calculated in Kaufeld et al., 2020). They were calculated by annotating the sentences by syllables, words, and phrasal and converting the rate of the linguistic units to frequency ranges. The information has been added to the manuscript. We acknowledge that unlike our stimulus set in natural speech the boundaries of these bands can overlap and now also state this (“While in our stimulus set the boundaries of the linguistic levels did not overlap, in natural speech the brain has an even more difficult task as there is no one-to-one match between band and linguistic unit [26]”, line number 211-213).

      1. What is missing in the manuscript are the explanations of the correlation between behavioral performance and neural tracking. In Results, the behavioral performance shows significant differences across the active tasks (Figure 1), but the MI differences across the tasks are relatively weak in IFG (Figure 3). In addition, the behavioral performance only shows significant differences between the sentence and word list conditions during the phrasal task, but the MI differences between the conditions are significant in MTG during the syllabic, word, and phrasal tasks. Explanations for these inconsistent results are expected.

      We answer this point together with point 4 below where we analyze the behavioral performance and the MEG responses.

      1. Since the behavioral performance of these active tasks is likely related to the temporal attention to relevant timescales of different linguistic units, I wonder whether there exist underlying neural correlates of behavioral performance (e.g., significant correlation between performance and mutual information). If so, it may be interesting and bring a new bright spot for the current study.

      The influence of task difficulty is a relevant point to investigate (also see point 1 of reviewer 2 and point 4 of reviewer 3). To do so we added accuracy (per participant per condition) as a factor in the mixed model (as well as all interactions with task and condition) for the MI, power, and connectivity analyses at the phrasal rate/delta band. Note that as for the passive task there is no accuracy, we removed the passive task from the analyses. We could also only run models with random intercepts (not random slopes), due to the reduced number of degrees of freedom when adding the factor accuracy to the models.

      For the MI analysis we only found an effect in MTG. Specifically, there was a three-way interaction between task, condition and accuracy (F(2, 91.9) = 3.4591, p = 0.036). To follow up on this three-way interaction we split the data per task. The condition*accuracy interaction was only (uncorrected) significant for the word combination task (F(1,24.8) = 5.296, p = 0.03 (uncorrected)) and not for any other task (p>0.1). In the word combination task, we found that the difference between sentences and word lists was the strongest at high accuracies (see the below figure the predicted values of the model). One way to interpret this finding is that stronger phrasal-rate MI tracking in MTG promotes phrasal-rate processing (as indicated by accuracy) more in sentences than in word lists.

      MEG – behavioral performance relation. A) Predicted values for the phrasal band MI in the MTG for the word combination task separately for the two conditions. B) Predicted values for the delta band WPLI in the STG-MTG connection separately for the two conditions. Error bars indicate the 95% confidence interval of the fit. Colored lines at the bottom indicate individual datapoints.

      For power we did not find any effect of accuracy. For the connectivity analysis we found in the STG-MTG connectivity a significant conditionaccuracy interaction (F(1, 80.23)=5.19, p = 0.025). The conditionaccuracy interaction showed that lower accuracies were generally associated with stronger differences between the sentences and word lists (see figure attached; the opposite of the MI analysis). Thus, functional connections in the delta band are stronger during sentence processing when participants have difficulty with the task (independent of the task performed). This could indicate that low-frequency connections are more relevant for the sentence than the word list condition.

      After correcting for accuracy there was also a significant task*condition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005 corrected), but not for the other tasks (p>0.1).

      We added the results of the accuracy analyses in the main manuscript as well as adding a dedicated section in our discussion section (page 21-22). Adding accuracy did not remove any of the effects we report in the original analyses. Therefore, none of these finding change the interpretation of the results as the task still had an influence on the MI responses of MTG and IFG. The effect of accuracy in the MTG refined the results showing that the effect was strongest there for participants with high accuracies. This relationship suggests a functional role of tracking through phase alignment for understanding phrasal structure.

      While the findings can explain some behavioral effects, we agree with the reviewer that the behavioral results and the MI results don’t align. We note that our use of tasks to guide attention to different timescales and linguistic representations differs from the use of, for example, a working memory task where only the correct trials contain the relevant cognitive process. In working memory type paradigms, the MEG data should indeed explain the behavioral response. Our study was designed to test for effects of task demands on the neural tracking response to speech and language. As we are only using the tasks to control attention, we do not attempt to explain behavior through the MEG data or differences in MI.

      Thus, the phrasal tracking cannot explain all of the behavioral results (point 3). It is at this point unclear what could have caused this effect, but it quite likely that neural sources outside the speech and language ROIs we selected are in play. We discuss this now.

      The methods now read: “MEG-behavioural performance analysis: To investigate the relation between the MEG measures and the behavioural performance we repeated the analyses (MI, power, and connectivity) but added accuracy as a factor (together with the interactions with the task and condition factor). As there is no accuracy for the passive task, we removed this task from the analysis. We then followed the same analyse steps as before. Since we reduced our degree of freedom, we could however only create random intercept and not random slope models”.

      The results now read: “MEG-behavioural performance relation. We found for the MI analysis a significant effect of accuracy only in the MTG. Here, we found a three-way interaction between accuracytaskcondition (F(2, 91.9) = 3.459, p = 0.036). Splitting up for the three different tasks we found only an uncorrected significant effect for the condition*accuracy interaction for the phrasal task (F(1, 24.8) = 5.296, p = 0.03) and not for the other two tasks (p>0.1). In the phrasal task, we found that when accuracy was high, there was a stronger difference between the sentence and the word list condition compared to when accuracy was low, with stronger accuracy for the sentence condition (Figure 7A).

      No relation between accuracy and power was found. For the connectivity analysis we found a significant conditionaccuracy interaction for the STG-MTG connection (F(1,80.23) = 5.19, p = 0.025; Figure 7B). Independent of task, when accuracy was low the difference between sentence and word lists was stronger with higher WPLI fits for the sentence condition. After correcting for accuracy there was also a significant taskcondition interaction (F(2,80.01) = 3.348, p = 0.040) and a main effect of condition (F(1,80.361) = 5.809, p = 0.018). While overall there was a stronger WPLI for the sentence compared to the word list condition, the interaction seemed to indicate that this was especially the case during the word task (p = 0.005), but not for the other tasks (p>0.1).”

      The discussion now reads: “We found that across participants both the MI and the connectivity in temporal cortex influenced behavioural performance. Specifically, MTG-STG connections were, independent of task, related to accuracy. There was higher connectivity between MTG and STG for sentences compared to word lists at low accuracies. At high accuracies, we found that stronger MTG tracking at phrasal rates (measured with MI) for sentences compared to word lists during the word combination task. These results suggest that indeed tracking of phrasal structure in MTG is relevant to understand sentences compared to word lists. This was reflected in a general increase in delta connectivity differences when the task was difficult (Figure 7B). Participants might compensate for the difficulty using phrasal structure present in the sentence condition. When phrasal structure in sentences are accurately tracked (as measured with MI) performance is better when these rates are relevant (Figure 7A). These results point to a role for phrasal tracking for accurately understanding the higher order linguistic structure in sentences even though more research is needed to verify this. It is evident that the connectivity and tracking correlations to behaviour do not explain all variation in the behavioural performance (compare Figure 1 with 3). Plainly, temporal tracking does not explain everything in language processing. Besides tracking there are many other components important for our designated tasks, such as memory load and semantic context which are not captured by our current analyses.”

    1. Author Response:

      Reviewer #1 (Public Review):

      [...] While the study is addressing an interesting topic, I also felt this manuscript was limited in novel findings to take away. Certainly the study clearly shows that substitution saturation is achieved at synonymous CpG sites. However, subsequent main analyses do not really show anything new: the depletion of segregating sites in functional versus neutral categories (Fig 2) has been extensively shown in the literature and polymorphism saturation is not a necessary condition for observing this pattern.

      We agree with the reviewer that many of the points raised were appreciated previously and did not mean to convey another impression. Our aim was instead to highlight some unique opportunities provided by being at or very near saturation for mCpG transitions. In that regard, we note that although depletion of variation in functional categories is to be expected at any sample size, the selection strength that this depletion reflects is very different in samples that are far from saturated, where invariant sites span the entire spectrum from neutral to lethal. Consider the depletion per functional category relative to synonymous sites in the adjoining plot in a sample of 100k: ~40% of mCpG LOF sites do not have T mutations. From our Fig. 4 and b, it can be seen that these sites are associated with a much broader range of hs values than sites invariant at 780k, so that information about selection at an individual site is quite limited (indeed, in our p-value formulation, these sites would be assigned p≤0.35, see Fig. 1). Thus, only now can we really start to tease apart weakly deleterious mutations from strongly deleterious or even embryonic lethal mutations. This allows us to identify individual sites that are most likely to underlie pathogenic mutations and functional categories that harbor deleterious variation at the extreme end of the spectrum of possible selection coefficients. More generally, saturation is useful because it allows one to learn about selection with many fewer untested assumptions than previously feasible.

      Similarly, the diminishing returns on sampling new variable sites has been shown in previous studies, for example the first "large" human datasets ca. 2012 (e.g. Fig 2 in Nelson et al. 2012, Science) have similar depictions as Figure 3B although with smaller sample sizes and different approaches (projection vs simulation in this study).

      We agree completely: diminishing returns is expected on first principles from coalescent theory, which is why we cited a classic theory paper when making that point in the previous version of the manuscript. Nonetheless, the degree of saturation is an empirical question, since it depends on the unknown underlying demography of the recent past. In that regard, we note that Nelson et al. predict that at sample sizes of 400K chromosomes in Europeans, approximately 20% of all synonymous sites will be segregating at least one of three possible alleles, when the observed number is 29%. Regardless, not citing Nelson et al. 2012 was a clear oversight on our part, for which we apologize; we now cite it in that context and in mentioning the multiple merger coalescent.

      There are some simulations presented in Fig 4, but this is more of a hypothetical representation of the site-specific DFE under simulation conditions roughly approximating human demography than formal inference on single sites. Again, these all describe the state of the field quite well, but I was disappointed by the lack of a novel finding derived from exploiting the mutation saturation properties at methylated CpG sites.

      As noted above, in our view, the novelty of our results lies in their leveraging saturation in order to identify sites under extremely strong selection and make inferences about selection without the need to rely on strong, untested assumptions.

      However, we note that Fig 4 is not simply a hypothetical representation, in that it shows the inferred DFE for single mCpG sites for a fixed mutation rate and given a plausible demographic model, given data summarized in terms of three ranges of allele frequency (i.e., = 0, between 1 and 10 copies, or above 10 copies). One could estimate a DFE across all sites from those summaries of the data (i.e., from the proportion of mCpG sites in each of the three frequency categories), by weighting the three densities in Fig 4 by those proportions. That is, in fact, what is done in a recent preprint by Dukler et al. (2021, BioRxiv): they infer the DFE from two summaries of the allele frequency spectrum (in bins of sites), the proportion of invariant sites and the proportion of alleles at 1-70 copies, in a sample of 70K chromosomes.

      To illustrate how something similar could be done with Fig. 4 based on individual sites, we obtain an estimate of the DFE for LOF mutations (shown in Panel B and D for two different prior distributions on hs) by weighting the posterior densities in Panel A by the fraction of LOF mutations that are segregating (73% at 780K; 9% at 15K) and invariant (27% and 91% respectively); in panel C, we show the same for a different choice of prior. For the smaller sample size considered, the posterior distribution recapitulates the prior, because there is little information about selection in whether a site is observed to be segregating or invariant, and particularly about strong selection. In the sample of 780K, there is much more information about selection in a site being invariant and therefore, there is a shift towards stronger selection coefficients for LOF mutations regardless of the prior.

      Our goal was to highlight these points rather than infer a DFE using these two summaries, which throw out much of the information in the data (i.e., the allele frequency differences among segregating sites). In that regard, we note that the DFE inference would be improved by using the allele frequency at each of 1.1 million individual mCpG sites in the exome. We outline this next step in the Discussion but believe it is beyond the scope of our paper, as it is a project in itself – in particular it would require careful attention to robustness with regard to both the demographic model (and its impact on multiple hits), biased gene conversion and variability in mutation rates among mCpG sites. We now make these points explicitly in the Outlook.

      Similarly, I felt the authors posed a very important point about limitations of DFE inference methods in the Introduction but ended up not really providing any new insights into this problem. The authors argue (rightly so) that currently available DFE estimates are limited by both the sparsity of polymorphisms and limited flexibility in parametric forms of the DFE. However, the nonsynonymous human DFE estimates in the literature appear to be surprisingly robust to sample size: older estimates (Eyre-Walker et al. 2006 Genetics, Boyko et al. 2008 PLOS Genetics) seem to at least be somewhat consistent with newer estimates (assuming the same mutation rate) from samples that are orders of magnitude larger (Kim et al. 2017 Genetics).

      We are not quite sure what the reviewer has in mind by “somewhat consistent,” as Boyko et al. estimate that 35% of non-synonymous mutations have s>10^-2 while Kim et al. find that proportion to be “0.38–0.84 fold lower” than the Boyko et al. estimate (see, e.g., Fig. 4 in Kim et al., 2017). Moreover, the preprint by Dukler et al. mentioned above, which infers the DFE based on ~70K chromosomes, finds estimates inconsistent with those of Kim et al. (see SOM Table 2 and SOM Figure S5 in Dukler et al., 2021).

      More generally, given that even 70K chromosomes carry little information about much of the distribution of selection coefficients (see our Fig. 4), we expect that studies based on relatively sample sizes will basically recover something close to their prior; therefore, they should agree when they use the same or similar parametric forms for the distribution of selection coefficients and disagree otherwise. The dependence on that choice is nicely illustrated in Kim et al., who consider different choices and then perform inference on the same data set and with the same fixed mutation rate for exomes; depending on their choice anywhere between 5%-28% of non-synonymous changes are inferred to be under strong selection with s>=10^-2 (see their Table S4).

      Whether a DFE inferred under polymorphism saturation conditions with different methods is different, and how it is different, is an issue of broad and immediate relevance to all those conducting population genomic simulations involving purifying selection. The analyses presented as Fig 4A and 4B kind of show this, but they are more a demonstration of what information one might have at 1M+ sample sizes rather than an analysis of whether genome-wide nonsynonymous DFE estimates are accurate. In other words, this manuscript makes it clear that a problem exists, that it is a fundamental and important problem in population genetics, and that with modern datasets we are now poised to start addressing this problem with some types of sites, but all of this is already very well-appreciated except for perhaps the last point.

      At least a crude analysis to directly compare the nonsynonymous genome-wide DFE from smaller samples to the 780K sample would be helpful, but it should be noted that these kinds of analyses could be well beyond the scope of the current manuscript. For example, if methylated nonsynonymous CpG sites are under a different level of constraint than other nonsynonymous sites (Fig. S14) then comparing results to a genome-wide nonsynonymous DFE might not make sense and any new analysis would have to try and infer a DFE independently from synonymous/nonsynonymous methylated CpG sites.

      We are not sure what would be learned from this comparison, given that Figure 4 shows that, at least with an uninformative prior, there is little information about the true DFE in samples, even of tens of thousands of individuals. Thus, if some of the genome-wide nonsynonymous DFE estimates based on small sample sizes turn out to be accurate, it will be because the guess about the parametric shape of the DFE was an inspired one. In our view, that is certainly possible but not likely, given that the shape of the DFE is precisely what the field has been aiming to learn and, we would argue, what we are now finally in a position to do for CpG mutations in humans.

      Reviewer #2 (Public Review):

      This manuscript presents a simple and elegant argument that neutrally evolving CpG sites are now mutationally saturated, with each having a 99% probability of containing variation in modern datasets containing hundreds of thousands of exomes. The authors make a compelling argument that for CpG sites where mutations would create genic stop codons or impair DNA binding, about 20% of such mutations are strongly deleterious (likely impairing fitness by 5% or more). Although it is not especially novel to make such statements about the selective constraint acting on large classes of sites, the more novel aspect of this work is the strong site-by-site prediction it makes that most individual sites without variation in UK Biobank are likely to be under strong selection.

      The authors rightly point out that since 99% of neutrally evolving CpG sites contain variation in the data they are looking at, a CpG site without variation is likely evolving under constraint with a p value significance of 0.01. However, a weakness of their argument is that they do not discuss the associated multiple testing problem-in other words, how likely is it that a given non synonymous CpG site is devoid of variation but actually not under strong selection? Since one of the most novel and useful deliverables of this paper is single-base-pair-resolution predictions about which sites are under selection, such a multiple testing correction would provide important "error bars" for evaluating how likely it is that an individual CpG site is actually constrained, not just the proportion of constrained sites within a particular functional category.

      We thank the reviewer for pointing this out. One way to think about this problem might be in terms of false discovery rates, in which case the FDR would be 16% across all non-synonymous mCpG sites that are invariant in current samples, and ~4% for the subset of those sites where mutations lead to loss-of-function of genes.

      Another way to address this issue, which we had included but not emphasized previously, is by examining how one’s beliefs about selection should be updated after observing a site to be invariant (i.e., using Bayes odds). At current sample sizes and assuming our uninformative prior, for a non-synonymous mCpG site that does not have a C>T mutation, the Bayes odds are 15:1 in favor of hs>0.5x10^-3; thus the chance that such a site is not under strong selection is 1/16, given our prior and demographic model. These two approaches (FDR and Bayes odds) are based on somewhat distinct assumptions.

      We have now added and/or emphasized these two points in the main text.

      The paper provides a comparison of their functional predictions to CADD scores, an older machine-learning-based attempt at identifying site by site constraint at single base pair resolution. While this section is useful and informative, I would have liked to see a discussion of the degree to which the comparison might be circular due to CADD's reliance on information about which sites are and are not variable. I had trouble assessing this for myself given that CADD appears to have used genetic variation data available a few years ago, but obviously did not use the biobank scale datasets that were not available when that work was published.

      We apologize for the lack of clarity in the presentation. We meant to emphasize that de novo mutation rates vary across CADD deciles when considering all CpG sites (Fig. 2-figure supplement 5c), which confounds CADD precisely because it is based in part on which sites are variable. We have edited the manuscript to clarify this.

      Reading this paper left me excited about the possibility of examining individual invariant CpG sites and deducing how many of them are already associated with known disease phenotypes. I believe the paper does not mention how many of these invariant sites appear in Clinvar or in databases of patients with known developmental disorders, and I wondered how close to saturation disease gene databases might be given that individuals with developmental disorders are much more likely to have their exomes sequenced compared to healthy individuals. One could imagine some such analyses being relatively low hanging fruit that could strengthen the current paper, but the authors also make several reference to a companion paper in preparation that deals more directly with the problem of assessing clinical variant significance. This is a reasonable strategy, but it does give the discussion section of the paper somewhat of a "to be continued" feel.

      We apologize for the confusion that arose from our references to a second manuscript in prep. The companion paper is not a continuation of the current manuscript: it contains an analysis of fitness and pathogenic effects of loss-of-function variation in human exomes.

      Following the reviewer’s suggestion to address the clinical significance of our results, we have now examined the relationship of mCpG sites invariant in current samples with Clinvar variants. We find that of the approximately 59,000 non-synonymous mCpG sites that are invariant, only ~3.6% overlap with C>T variants associated with at least one disease and classified as likely pathogenic in Clinvar (~5.8% if we include those classified as uncertain or with conflicting evidence as pathogenic). Approximately 2% of invariant mCpGs have C>T mutations in what is, to our knowledge, the largest collection of de novo variants ascertained in ~35,000 individuals with developmental disorders (DDD, Kaplanis et al. 2020). At the level of genes, of the 10k genes that have at least one invariant non-synonymous mCpG, only 8% (11% including uncertain variants) have any non-synonymous hits in Clinvar, and ~8% in DDD. We think it highly unlikely that the large number of remaining invariant sites are not seen with mutations in these databases because such mutations are lethal; rather it seems to us to be the case that these disease databases are far from saturation as they contain variants from a relatively small number of individuals, are subject to various ascertainment biases both at the variant level and at the individual level, and only contain data for a small subset of existing severe diseases.

      With a view to assessing clinical relevance however, we can ask a related question, namely how informative being invariant in a sample of 780k is about pathogenicity in Clinvar. Although the relationship between selection and pathogenicity is far from straightforward, being an invariant non-synonymous mCpG in current samples not only substantially increases (15-10fold) the odds of hs > 0.5x10-3 (see Fig. 4b), it also increases the odds of being classified as pathogenic vs. benign in Clinvar 8-51 fold. In the DDD sample, we don’t know which variants are pathogenic; however, if we consider non-synonymous mutations that occur in consensus DDD genes as pathogenic (a standard diagnostic criterion), being invariant increases the odds of being classified as pathogenic 6-fold. We caution that both Clinvar classifications and the identification of consensus genes in DDD relies in part on whether a site is segregating in datasets like ExAC, so this exercise is somewhat circular. Nonetheless it illustrates that there is some information about clinical importance in mCpG sites that are invariant in current samples, and that the degree of enrichment (6 to 51-fold) is very roughly on par with the Bayes odds that we estimate of strong selection conditional on a site being invariant. We have added these findings to the main text and added the plot as Supplementary Figure 13.

      Reviewer #3 (Public Review):

      [...] The authors emphasize several times how important an accurate demographic model is. While we may be close to a solid demographic model for humans, this is certainly not the case for many other organisms. Yet we are not far off from sufficient sample sizes in a number of species to begin to reach saturation. I found myself wondering how different the results/inference would be under a different model of human demographic history. Though likely the results would be supplemental, it would be nice in the main text to be able to say something about whether results are qualitatively different under a somewhat different published model.

      We had previously examined the effect of a few demographic scenarios with large increases in population size towards the present on the average length of the genealogy of a sample (and hence the expected number of mutations at a site) in Figure 3-figure supplement 1b, but without quantifying the effect on our selection inference. Following this suggestion, we now consider a widely used model of human demography inferred from a relatively small sample, and therefore not powered to detect the huge increase in population size towards the present (Tennessen et al. 2012). Using this model, we find a poor fit to the proportion of segregating CpG sites (the observed fraction is 99% in 780k exomes, when the model predicts 49%). Also, as expected, inferences about selection depend on the accuracy of the demographic model (as can be seen by comparing panel B to Fig 4B in the main text).

      On a similar note, while a fixed hs simplifies much of the analysis, I wondered how results would differ for 1) completely recessive mutations and 2) under a distribution of dominance coefficients, especially one in which the most deleterious alleles were more recessive. Again, though I think it would strengthen the manuscript by no means do I feel this is a necessary addition, though some discussion of variation in dominance would be an easy and helpful add.

      There's some discussion of population structure, but I also found myself wondering about GxE. That is, another reason a variant might be segregating is that it's conditionally neutral in some populations and only deleterious in a subset. I think no analysis to be done here, but perhaps some discussion?

      We agree that our analysis ignores the possibilities of complete recessivity in fitness (h=0) as well as more complicated selection scenarios, such as spatially-varying selection (of the type that might be induced by GxE). We note however that so long as there are any fitness effects in heterozygotes, the allele dynamics will be primarily governed by hs; one might also imagine that under some conditions, the mean selection effect across environments would predict allele dynamics reasonably well even in the presence of GxE. Also worth exploring in our view is the standard assumption that hs remains fixed even as Ne changes dramatically. We now mention these points in the Outlook.

      Maybe I missed it, but I don't think the acronym DNM is explained anywhere. While it was fairly self-explanatory, I did have a moment of wondering whether it was methylation or mutation and can't hurt to be explicit.

      We apologize for the oversight and have updated the text accordingly.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors evaluate the involvement of the hippocampus in a fast-paced time-to-contact estimation task. They find that the hippocampus is sensitive to feedback received about accuracy on each trial and has activity that tracks behavioral improvement from trial to trial. Its activity is also related to a tendency for time estimation behavior to regress to the mean. This is a novel paradigm to explore hippocampal activity and the results are thus novel and important, but the framing as well as discussion about the meaning of the findings obscures the details of the results or stretches beyond them in many places, as detailed below.

      We thank the reviewer for their constructive feedback and were happy to read that s/he considered our approach and results as novel and important. The comments led us to conduct new fMRI analyses, to clarify various unclear phrasings regarding our methods, and to carefully assess our framing of the interpretation and scope of our results. Please find our responses to the individual points below.

      1) Some of the results appear in the posterior hippocampus and others in the anteriorhippocampus. The authors do not motivate predictions for anterior vs. posterior hippocampus, and they do not discuss differences found between these areas in the Discussion. The hippocampus is treated as a unitary structure carrying out learning and updating in this task, but the distinct areas involved motivate a more nuanced picture that acknowledges that the same populations of cells may not be carrying out the various discussed functions.

      We thank the reviewer for pointing this out. We split the hippocampus into anterior and posterior sections because prior work suggested a different whole-brain connectivity and function of the two. This was mentioned in the methods section (page 15) in the initial submission but unfortunately not in the main text. Moreover, when discussing the results, we did indeed refer mostly to the hippocampus as a unitary structure for simplicity and readability, and because statements about subcomponents are true for the whole. However, we agree with the reviewer that the differences between anterior and posterior sections are very interesting, and that describing these effects in more detail might help to guide future work more precisely.

      In response to the reviewer's comment, we therefore clarified at various locations throughout the manuscript whether the respective results were observed in the posterior or anterior section of the hippocampus, and we extended our discussion to reflect the idea that different functions may be carried out by distinct populations of hippocampal cells. In addition, we also now motivate the split into the different sections better in the main text. We made the following changes.

      Page 3: “Second, we demonstrate that anterior hippocampal fMRI activity and functional connectivity tracks the behavioral feedback participants received in each trial, revealing a link between hippocampal processing and timing-task performance.

      Page 3: “Fourth, we show that these updating signals in the posterior hippocampus were independent of the specific interval that was tested and activity in the anterior hippocampus reflected the magnitude of the behavioral regression effect in each trial.”

      Page 5: “We performed both whole-brain voxel-wise analyses as well as regions-of-interest (ROI) analysis for anterior and posterior hippocampus separately, for which prior work suggested functional differences with respect to their contributions to memory-guided behavior (Poppenk et al., 2013, Strange et al. 2014).”

      Page 9: “Because anterior and posterior sections of the hippocampus differ in whole-brain connectivity as well as in their contributions to memory-guided behavior (Strange et al. 2014), we analyzed the two sections separately. “

      Page 9: “We found that anterior hippocampal activity as well as functional connectivity reflected the feedback participants received during this task, and its activity followed the performance improvements in a temporal-context-dependent manner. Its activity reflected trial-wise behavioral biases towards the mean of the sampled intervals, and activity in the posterior hippocampus signaled sensorimotor updating independent of the specific intervals tested.”

      Page 10: “Intriguingly, the mechanisms at play may build on similar temporal coding principles as those discussed for motor timing (Yin & Troger, 2011; Eichenbaum, 2014; Howard, 2017; Palombo & Verfaellie, 2017; Nobre & van Ede, 2018; Paton & Buonomano, 2018; Bellmund et al., 2020, 2021; Shikano et al., 2021; Shimbo et al., 2021), with differential contributions of the anterior and posterior hippocampus. Note that our observation of distinct activity modulations in the anterior and posterior hippocampus suggests that the functions and coding principles discussed here may be mediated by at least partially distinct populations of hippocampal cells.”

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence [...]

      2) Hippocampal activity is stronger for smaller errors, which makes the interpretationmore complex than the authors acknowledge. If the hippocampus is updating sensorimotor representations, why would its activity be lower when more updating is needed?

      Indeed, we found that absolute (univariate) activity of the hippocampus scaled with feedback valence, the inverse of error (Fig. 2A). We see multiple possibilities for why this might be the case, and we discussed some of them in a dedicated discussion section (“The role of feedback in timed motor actions”). For example, prior work showed that hippocampal activity reflects behavioral feedback also in other tasks, which has been linked to learning (e.g. Schönberg et al., 2007; Cohen & Ranganath, 2007; Shohamy & Wagner, 2008; Foerde & Shohamy, 2011; Wimmer et al., 2012). In our understanding, sensorimotor updating is a form of ‘learning’ in an immediate and behaviorally adaptive manner, and we therefore consider our results well consistent with this earlier work. We agree with the reviewer that in principle activity should be stronger if there was stronger sensorimotor updating, but we acknowledge that this intuition builds on an assumption about the relationship between hippocampal neural activity and the BOLD signal, which is not entirely clear. For example, prior work revealed spatially informative negative BOLD responses in the hippocampus as a function of visual stimulation (e.g. Szinte & Knapen 2020), and the effects of inhibitory activity - a leading motif in the hippocampal circuitry - on fMRI data are not fully understood. This raises the possibility that the feedback modulation we observed might also involve negative BOLD responses, which would then translate to the observed negative correlation between feedback valence and the hippocampal fMRI signal, even if the magnitude of the underlying updating mechanism was positively correlated with error. This complicates the interpretation of the direction of the effect, which is why we chose to avoid making strong conclusions about it in our manuscript. Instead, we tried discussing our results in a way that was agnostic to the direction of the feedback modulation. Importantly, hippocampal connectivity with other regions did scale positively with error (Fig. 2B), which we again discussed in the dedicated discussion section.

      In response to the reviewer’s comment, we revisited this section of our manuscript and felt the latter result deserved a better discussion. We therefore took this opportunity to extend our discussion of the connectivity results (including their relationship to the univariate-activity results as well as the direction of these effects), all while still avoiding strong conclusions about directionality. Following changes were made to the manuscript.

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence, unlike its absolute activity, which scaled positively with feedback valence (Fig. 2A,B), suggesting that the two measures may be sensitive to related but distinct processes.

      Page 11: Such network-wide receptive-field re-scaling likely builds on a re-weighting of functional connections between neurons and regions, which may explain why anterior hippocampal connectivity correlated negatively with feedback valence in our data. Larger errors may have led to stronger re-scaling, which may be grounded in a corresponding change in functional connectivity.

      3) Some tests were one-tailed without justification, which reduces confidence in the robustness of the results.

      We thank the reviewer for pointing us to the fact that our choice of statistical tests was not always clear in the manuscript. In the analysis the reviewer is referring to, we predicted that stronger sensorimotor updating should lead to stronger activity as well as larger behavioral improvements across the respective trials. This is because a stronger update should translate to a more accurate “internal model” of the task and therefore to a better performance. We tested this one-sided hypothesis using the appropriate test statistic (contrasting trials in which behavioral performance did improve versus trials in which it did not improve), but we did not motivate our reasoning well enough in the manuscript. The revised manuscript therefore includes the two new statements shown below to motivate our choice of test statistic more clearly.

      Page 7: [...] we contrasted trials in which participants had improved versus the ones in which they had not improved or got worse (see methods for details). Because stronger sensorimotor updating should lead to larger performance improvements, we predicted to find stronger activity for improvements vs. no improvements in these tests (one-tailed hypothesis).

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively. Because we predicted to find stronger activity for improvements vs. no improvements in behavioral performance, we here performed one-tailed statistical tests, consistent with the direction of this hypothesis. Improvement in performance was defined as receiving feedback of higher valence than in the corresponding previous trial.

      4) The introduction motivates the novelty of this study based on the idea that thehippocampus has traditionally been thought to be involved in memory at the scale of days and weeks. However, as is partially acknowledged later in the Discussion, there is an enormous literature on hippocampal involvement in memory at a much shorter timescale (on the order of seconds). The novelty of this study is not in the timescale as much as in the sensorimotor nature of the task.

      We thank the reviewer for this helpful suggestion. We agree that a key part of the novelty of this study is the use of the task that is typically used to study sensorimotor integration and timing rather than hippocampal processing, along with the new insights this task enabled about the role of the hippocampus in sensorimotor updating. As mentioned in the discussion, we also agree with the reviewer that there is prior literature linking hippocampal activity to mnemonic processing on short time scales. We therefore rephrased the corresponding section in the introduction to put more weight on the sensorimotor nature of our task instead of the time scales.

      Note that the new statement still includes the time scale of the effects, but that it is less at the center of the argument anymore. We chose to keep it in because we do think that the majority of studies on hippocampal-dependent memory functions focus on longer time scales than our study does, and we expect that many readers will be surprised about the immediacy of how hippocampal activity relates to ongoing behavioral performance (on ultrashort time scales).

      We changed the introduction to the following.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions. Importantly, the hippocampus is not traditionally thought to support sensorimotor functions, and its contributions to memory formation are typically discussed for longer time scales (hours, days, weeks). Here, however, we characterize in detail the relationship between hippocampal activity and real-time behavioral performance in a fast-paced timing task, which is traditionally believed to be hippocampal-independent. We propose that the capacity of the hippocampus to encode statistical regularities of our environment (Doeller et al. 2005, Shapiro et al. 2017, Behrens et al., 2018; Momennejad, 2020; Whittington et al., 2020) situates it at the core of a brain-wide network balancing specificity vs. regularization in real time as the relevant behavior is performed.

      5) The authors used three different regressors for the three feedback levels, asopposed to a parametric regressor indexing the level of feedback. The predictions are parametric, so a parametric regressor would be a better match, and would allow for the use of all the medium-accuracy data.

      The reviewer raises a good point that overlaps with question 3 by reviewer 2. In the current analysis, we model the three feedback levels with three independent regressors (high, medium, low accuracy). We then contrast high vs. low accuracy feedback, obtaining the results shown in Fig. 2AB. The beta estimates obtained for medium-accuracy feedback are being ignored in this contrast. Following the reviewer’s feedback, we therefore re-run the model, this time modeling all three feedback levels in one parametric regressor. All other regressors in the model stayed the same. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor.

      The results we observed were highly consistent across the two analyses, and all conclusions presented in the initial manuscript remain unchanged. While the exact t-scores differ slightly, we replicated the effects for all clusters on the voxel-wise map (on whole-brain FWE-corrected levels) as well as for the regions-of-interest analysis for anterior and posterior hippocampus. These results are presented in a new Supplementary Figure 3C.

      Note that the new Supplementary Figure 3B shows another related new analyses we conducted in response to question 4 of reviewer 2. Here, we re-ran the initial analysis with three feedback regressors, but without modeling the inter-trial interval (ITI) and the inter-session interval (ISI, i.e. the breaks participants took) to avoid model over-specification. Again, we replicated the results for all clusters and the ROI analysis, showing that the initial results we presented are robust.

      The following additions were made to the manuscript.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using the beta estimates obtained for the parametric feedback regressor (Fig. 2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      6) The authors claim that the results support the idea that the hippocampus is findingan "optimal trade-off between specificity and regularization". This seems overly speculative given the results presented.

      We understand the reviewer's skepticism about this statement and agree that the manuscript does not show that the hippocampus is finding the trade-off between specificity and regularization. However, this is also not exactly what the manuscript claims. Instead, it suggests that the hippocampus “may contribute” to solving this trade-off (page 3) as part of a “brain-wide network“ (pages 2,3,9,12). We also state that “Our [...] results suggest that this trade-off [...] is governed by many regions, updating different types of task information in parallel” (Page 11). To us, these phrasings are not equivalent, because we do not think that the role of the hippocampus in sensorimotor updating (or in any process really) can be understood independently from the rest of the brain. We do however think that our results are in line with the idea that the hippocampus contributes to solving this trade-off, and that this is exciting and surprising given the sensorimotor nature of our task, the ultrashort time scale of the underlying process, and the relationship to behavioral performance. We tried expressing that some of the points discussed remain speculation, but it seems that we were not always successful in doing so in the initial submission. We apologize for the misunderstanding, adapted corresponding statements in the manuscript, and we express even more carefully that these ideas are speculation.

      Following changes were made to the introduction and discussion.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      Page 12: This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 13: This is in line with the notion that the hippocampus [...] supports finding an optimal trade off between specificity and regularization along with other regions. [...] Our results show that the hippocampus supports rapid and feedback-dependent updating of sensorimotor representations, suggesting that it is a central component of a brain-wide network balancing task specificity vs. regularization for flexible behavior in humans.

      Note that in response to comment 1 by reviewer 2, the revised manuscript now reports the results of additional behavioral analyses that support the notion that participants find an optimal trade-off between specificity and regularization over time (independent of whether the hippocampus was involved or not).

      7) The authors find that hippocampal activity is related to behavioral improvement fromthe prior trial. This seems to be a simple learning effect (participants can learn plenty about this task from a prior trial that does not have the exact same timing as the current trial) but is interpreted as sensitivity to temporal context. The temporal context framing seems too far removed from the analyses performed.

      We agree with the reviewer that our observation that hippocampal activity reflects TTC-independent behavioral improvements across trials could have multiple explanations. Critically, i) one of them is that the hippocampus encodes temporal context, ii) it is only one of multiple observations that we build our interpretation on, and iii) our interpretation builds on multiple earlier reports

      Interval estimates regress toward the mean of the sampled intervals, an effect that is often referred to as the “regression effect”. This effect, which we observed in our data too (Fig. 1B), has been proposed to reflect the encoding of temporal context (e.g. Jazayeri & Shadlen 2010). Moreover, there is a large body of literature on how the hippocampus may support the encoding of spatial and temporal context (e.g. see Bellmund, Polti & Doeller 2020 for review).

      Because both hippocampal activity and the regression effect were linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. If so, one would expect that hippocampal activity should reflect behavioral improvements independently from TTC, it should reflect the magnitude of the regression effect, and it should generally reflect feedback, because it is the feedback that informs the updating of the internal prior.

      All three observations may have independent explanations indeed, but they are all also in line with the idea that the hippocampus does encode temporal context and that this explains the relationship between hippocampal activity and the regression effect. It therefore reflects a sparse and reasonable explanation in our opinion, even though it necessarily remains an interpretation. Of course, we want to be clear on what our results are and what our interpretations are.

      In response to the reviewer’s comment, we therefore toned down two of the statements that mention temporal context in the manuscript, and we removed an overly speculative statement from the result section. In addition, the discussion now describes more clearly how our results are in line with this interpretation.

      Abstract: This is in line with the idea that the hippocampus supports the rapid encoding of temporal context even on short time scales in a behavior-dependent manner.

      Page 13: This is in line with the notion that the hippocampus encodes temporal context in a behavior-dependent manner, and that it supports finding an optimal trade off between specificity and regularization along with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      The following statement was removed, overlapping with comment 2 by Reviewer 3:

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      8) I am not sure the term "extraction of statistical regularities" is appropriate. The termis typically used for more complex forms of statistical relationships.

      We agree with the reviewer that this expression may be interpreted differently by different readers and are grateful to be pointed to this fact. We therefore removed it and instead added the following (hopefully less ambiguous) statement to the manuscript.

      Page 9: This study investigated how the human brain flexibly updates sensorimotor representations in a feedback-dependent manner in the service of timing behavior.

      Reviewer #2 (Public Review):

      The authors conducted a study involving functional magnetic resonance imaging and a time-to-contact estimation paradigm to investigate the contribution of the human hippocampus (HPC) to sensorimotor timing, with a particular focus on the involvement of this structure in specific vs. generalized learning. Suggestive of the former, it was found that HPC activity reflected time interval-specific improvements in performance while in support of the latter, HPC activity was also found to signal improvements in performance, which were not specific to the individual time intervals tested. Based on these findings, the authors suggest that the human HPC plays a key role in the statistical learning of temporal information as required in sensorimotor behaviour.

      By considering two established functions of the HPC (i.e., temporal memory and generalization) in the context of a domain that is not typically associated with this structure (i.e., sensorimotor timing), this study is potentially important, offering novel insight into the involvement of the HPC in everyday behaviour. There is much to like about this submission: the manuscript is clearly written and well-crafted, the paradigm and analyses are well thought out and creative, the methodology is generally sound, and the reported findings push us to consider HPC function from a fresh perspective. A relative weakness of the paper is that it is not entirely clear to what extent the data, at least as currently reported, reflects the involvement of the HPC in specific and generalized learning. Since the authors' conclusions centre around this observation, clarifying this issue is, in my opinion, of primary importance.

      We thank the reviewer for these positive and extremely helpful comments, which we will address in detail below. In response to these comments, the revised manuscript clarifies why the observed performance improvements are not at odds with the idea that an optimal trade-off between specificity and regularization is found, and how the time course of learning relates to those reported in previous literature. In addition, we conducted two new fMRI analyses, ensuring that our conclusions remain unchanged even if feedback is modeled with one parametric regressor, and if the number or nuisance regressors is reduced to control for overparameterization of the model. Please find our responses underneath each individual point below.

      1) Throughout the manuscript, the authors discuss the trade-off between specific and generalized learning, and point towards Figure S1D as evidence for this (i.e., participants with higher TTC accuracy exhibited a weaker regression effect). What appears to be slightly at odds with this, however, is the observation that the deviation from true TTC decreased with time (Fig S1F) as the regression line slope approached 0.5 (Fig S1E) - one would have perhaps expected the opposite i.e., for deviation from true TTC to increase as generalization increases. To gain further insight into this, it would be helpful to see the deviation from true TTC plotted for each of the four TTC intervals separately and as a signed percentage of the target TTC interval (i.e., (+) or (-) deviation) rather than the absolute value.

      We thank the reviewer for raising this important question and for the opportunity to elaborate on the relationship between the TTC error and the magnitude of the regression effect in behavior. Indeed, we see that the regression slopes approach 0.5 and that the TTC error decreases over the course of the experiment. We do not think that these two observations are at odds with each other for the following reasons:

      First, while the reviewer is correct in pointing out that the deviation from the TTC should increase as “generalization increases”, that is not what we found. It was not the magnitude of the regularization per se that increased over time, but the overall task performance became more optimal in the face of both objectives: specificity and generalization. This optimum is at a regression-line slope of 0.5. Generalization (or regularization how we refer to it in the present manuscript), therefore did not increase per se on group level.

      Second, the regression slopes approached 0.5 on the group-level, but the individual participants approached this level from different directions: Some of them started with a slope value close to 1 (high accuracy), whereas others started with a slope value close to 0 (near full regression to the mean). Irrespective of which slope value they started with, over time, they got closer to 0.5 (Rebuttal Figure 1A). This can also be seen in the fact that the group-level standard deviation in regression slopes becomes smaller over the course of the experiment (Rebuttal Figure 1B, SFig 1G). It is therefore not generally the case that the regression effect becomes stronger over time, but that it becomes more optimal for longer-term behavioral performance, which is then also reflected in an overall decrease in TTC error. Please see our response to the reviewer’s second comment for more discussion on this.

      Third, the development of task performance is a function of two behavioral factors: a) the accuracy and b) the precision in TTC estimation. Accuracy describes how similar the participant’s TTC estimates were to the true TTC, whereas precision describes how similar the participant’s TTC estimates were relative to each other (across trials). Our results are a reflection of the fact that participants became both more accurate over time on average, but also more precise. To demonstrate this point visually, we now plotted the Precision and the Accuracy for the 8 task segments below (Rebuttal Figure 1C, SFig 1H), showing that both measures increased as the time progressed and more trials were performed. This was the case for all target durations.

      In response to the reviewer’s comment, we clarified in the main text that these findings are not at odds with each other. Furthermore, we made clear that regularization per se did not increase over time on group level. We added additional supporting figures to the supplementary material to make this point. Note that in our view, these new analyses and changes more directly address the overall question the reviewer raised than the figure that was suggested, which is why we prioritized those in the manuscript.

      However, we appreciated the suggestion a lot and added the corresponding figure for the sake of completeness.

      Following additions were made.

      Page 5: In support of this, participants' regression slopes converged over time towards the optimal value of 0.5, i.e. the slope value between veridical performance and the grand mean (Fig. S1F; linear mixed-effects model with task segment as a predictor and participants as the error term, F(1) = 8.172, p = 0.005, ε2=0.08, CI: [0.01, 0.18]), and participants' slope values became more similar (Fig. S1G; linear regression with task segment as predictor, F(1) = 6.283, p = 0.046, ε2 = 0.43, CI: [0, 1]). Consequently, this also led to an improvement in task performance over time on group level (i.e. task accuracy and precision increased (Fig. S1I), and the relationship between accuracy and precision became stronger (Fig. S1H), linear mixed-effect model results for accuracy: F(1) = 15.127, p = 1.3x10-4, ε2=0.06, CI: [0.02, 0.11], precision: F(1) = 20.189, p = 6.1x10-5, ε2 = 0.32, CI: [0.13, 1]), accuracy-precision relationship: F(1) = 8.288, p =0.036, ε2 = 0.56, CI: [0, 1], see methods for model details).

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 15: We also corroborated this effect by measuring the dispersion of slope values between participants across task segments using a linear regression model with task segment as a predictor and the standard deviation of slope values across participants as the dependent variable (Fig. S1G). As a measure of behavioral performance, we computed two variables for each target-TTC level: sensorimotor timing accuracy, defined as the absolute difference in estimated and true TTC, and sensorimotor timing precision, defined as coefficient of variation (standard deviation of estimated TTCs divided by the average estimated TTC). To study the interaction between these two variables for each target TTC over time, we first normalized accuracy by the average estimated TTC in order to make both variables comparable. We then used a linear mixed-effects model with precision as the dependent variable, task segment and normalized accuracy as predictors and target TTC as the error term. In addition, we tested whether accuracy and precision increased over the course of the experiment using separate linear mixed-effects models with task segment as predictor and participants as the error term.

      2) Generalization relies on prior experience and can be relatively slow to develop as is the case with statistical learning. In Jazayeri and Shadlen (2010), for instance, learning a prior distribution of 11-time intervals demarcated by two briefly flashed cues (compared to 4 intervals associated with 24 possible movement trajectories in the current study) required ~500 trials. I find it somewhat surprising, therefore, that the regression line slope was already relatively close to 0.5 in the very first segment of the task. To what extent did the participants have exposure to the task and the target intervals prior to entering the scanner?

      We thank the reviewer for raising the important question about the time course of learning in our task and how our results relate to prior work on this issue. Addressing the specific reviewer question first, participants practiced the task for 2-3 minutes prior to scanning. During the practice, they were not specifically instructed to perform the task as well as they could nor to encode the intervals, but rather to familiarize themselves with the general experimental setup and to ask potential questions outside the MRI machine. While they might have indeed started encoding the prior distribution of intervals during the practice already, we have no way of knowing, and we expect the contribution of this practice on the time course of learning during scanning to be negligible (for the reasons outlined above).

      However, in addition to the specific question the reviewer asked, we feel that the comment raises two more general points: 1) How long does it take to learn the prior distribution of a set of intervals as a function of the number of intervals tested, and 2) Why are the learning slopes we report quite shallow already in the beginning of the scan?

      Regarding (1), we are not aware of published reports that answer this question directly, and we expect that this will depend on the task that is used. Regarding the comparison to Jazayeri & Shadlen (2010), we believe the learning time course is difficult to compare between our study and theirs. As the reviewer mentioned, our study featured only 4 intervals compared to 11 in their work, based on which we would expect much faster learning in our task than in theirs. We did indeed sample 24 movement directions, but these were irrelevant in terms of learning the interval distribution. Moreover, unlike Jazayeri & Shadlen (2010), our task featured moving stimuli, which may have added additional sensory, motor and proprioceptive information in our study which the participants of the prior study could not rely on.

      Regarding (2), and overlapping with the reviewer’s previous comment, the average learning slope in our study is indeed close to 0.5 already in the first task segment, but we would like to highlight that this is a group-level measure. The learning slopes of some subjects were closer to 1 (i.e. the diagonal in Fig 1B), and the one of others was closer to 0 (i.e. the mean) in the beginning of the experiment. The median slope was close to 0.65. Importantly, the slopes of most participants still approached 0.5 in the course of the experiment, and so did even the group-level slope the reviewer is referring to. This also means that participants’ slopes became more similar in the course of the experiment, and they approached 0.5, which we think reflects the optimal trade-off between regressing towards the mean and regressing towards the diagonal (in the data shown in Fig. 1B). This convergence onto the optimal trade-off value can be seen in many measures, including the mean slope (Rebuttal Figure 1A, SFig 1F), the standard deviation in slopes (Rebuttal Figure 1B, SFig 1G) as well as the Precision vs. Accuracy tradeoff (Rebuttal Figure 1C, SFig 1H). We therefore think that our results are well in line with prior literature, even though a direct comparison remains difficult due to differences in the task.

      In response to the reviewer’s comment, and related to their first comment, we made the following addition to the discussion section.

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is well in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      3) I am curious to know whether differences between high-accuracy andmedium-accuracy feedback as well as between medium-accuracy and low-accuracy feedback predicted hippocampal activity in the first GLM analysis (middle page 5). Currently, the authors only present the findings for the contrast between high-accuracy and low-accuracy feedback. Examining all feedback levels may provide additional insight into the nature of hippocampal involvement and is perhaps more consistent with the subsequent GLM analysis (bottom page 6) in which, according to my understanding, all improvements across subsequent trials were considered (i.e., from low-accuracy to medium-accuracy; medium-accuracy to high-accuracy; as well as low-accuracy to high-accuracy).

      We thank the reviewer for this thoughtful question, which relates to questions 5 by reviewer 1. The reviewer is correct that the contrast shown in Fig 2 does not consider the medium-accuracy feedback levels, and that the model in itself is slightly different from the one used in the subsequent analysis presented in Fig. 3. To reply to this comment as well as to a related one by reviewer 1 together, we therefore repeated the full analysis while modeling the three feedback levels in one parametric regressor, which includes the medium-accuracy feedback trials, and is consistent with the analysis shown in Fig. 3. The results of this new analysis are presented in the new Supplementary Fig. 3B.

      In short, the model included one parametric regressor with three levels reflecting the three types of feedback, and all nuisance regressors remained unchanged. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor. We found that our results presented initially were very robust: Both the observed clusters in the voxel-wise analysis (on whole-brain FWE-corrected levels) as well as the ROI results replicated across the two analyses, and our conclusions therefore remain unchanged.

      We made multiple textual additions to the manuscript to include this new analysis, and we present the results of the analysis including a direct comparison to our initial results in the new Supplementary Fig. 3. Following textual additions were.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using thebeta estimates obtained for the parametric feedback regressor (Fig. S2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      4) The authors modeled the inter-trial intervals and periods of rest in their univariateGLMs. This approach of modelling all 'down time' can lead to model over-specification and inaccurate parameter estimation (e.g. Pernet, 2014). A comment on this approach as well as consideration of not modelling the inter-trial intervals would be useful.

      This is an important issue that we did not address in our initial manuscript. We are aware and agree with the reviewer’s general concern about model over-specification, which can be a big problem in regression as it leads to biased estimates. We did examine whether our model was overspecified before running it, but we did not report a formal test of it in the manuscript. We are grateful to be given the opportunity to do so now.

      In response to the reviewer’s comment, we repeated the full analysis shown in Fig. 2 while excluding the nuisance regressors for inter-trial intervals (ISI) and breaks (or inter-session intervals, ISI). All other regressors and analysis steps stayed unchanged relative to the one reported in Fig. 2. The new results are presented in a new Supplementary Figure 3B.

      Like for our previous analysis, we again see that the results we initially presented were extremely robust even on whole-brain FWE corrected levels, as well as on ROI level. Our conclusions therefore remain unchanged, and the results we presented initially are not affected by potential model overspecification. In addition to the new Supplementary Figure 3B, we made multiple textual changes to the manuscript to describe this new analysis and its implications. Note that we used the same nuisance regressors in all other GLM analyses too, meaning that it is also very unlikely that model overspecification affects any of the other results presented. We thank the reviewer for suggesting this analysis, and we feel including it in the manuscript has further strengthened the points we initially made.

      Following additions were made to the manuscript.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI) [...]

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. All other regressors including the main feedback regressors of interest remained unchanged, and we repeated both the voxel-wise and ROI-wise statistical tests as described above (Fig. S2B).

      Page 17: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Reviewer #3 (Public Review):

      This paper reports the results of an interesting fMRI study examining the neural correlates of time estimation with an elegant design and a sensorimotor timing task. Results show that hippocampal activity and connectivity are modulated by performance on the task as well as the valence of the feedback provided. This study addresses a very important question in the field which relates to the function of the hippocampus in sensorimotor timing. However, a lack of clarity in the description of the MRI results (and associated methods) currently prevents the evaluation of the results and the interpretations made by the authors. Specifically, the model testing for timing-specific/timing-independent effects is questionable and needs to be clarified. In the current form, several conclusions appear to not be fully supported by the data.

      We thank the reviewer for pointing us to many methodological points that needed clarification. We apologize for the confusion about our methods, which we clarify in the revised manuscript. Please find our responses to the individual points below.

      Major points

      Some methodological points lack clarity which makes it difficult to evaluate the results and the interpretation of the data.

      We really appreciate the many constructive comments below. We feel that clarifying these points improved our manuscript immensely.

      1) It is unclear how the 3 levels of accuracy and feedback (high, medium, and lowperformance) were computed. Please provide the performance range used for this classification. Was this adjusted to the participants' performance?

      The formula that describes how the response window was computed for the different speed levels was reported in the methods section of the original manuscript on page 13. It reads as follows:

      “The following formula was used to scale the response window width: d ± ((k ∗ d)/2) where d is the target TTC and k is a constant proportional to 0.3 and 0.6 for high and medium accuracy, respectively.“

      In response to the reviewer’s comment, we now additionally report the exact ranges of the different response windows in a new Supplementary Table 1 and refer to it in the Methods section as follows.

      Page 10: To calibrate performance feedback across different TTC durations, the precise response window widths of each feedback level scaled with the speed of the fixation target (Table S1).

      2) The description of the MRI results lacks details. It is not always clear in the resultssection which models were used and whether parametric modulators were included or not in the model. This makes the results section difficult to follow. For example,

      a) Figure 2: According to the description in the text, it appears that panels A and B report the results of a model with 3 regressors, ie one for each accuracy/feedback level (high, medium, low) without parametric modulators included. However, the figure legend for panel B mentions a parametric modulator suggesting that feedback was modelled for each trial as a parametric modulator. The distinction between these 2 models must be clarified in the result section.

      We thank the reviewer very much for spotting this discrepancy. Indeed, Figure 2 shows the results obtained for a GLM in which we modeled the three feedback levels with separate regressors, not with one parametric regressor. Instead, the latter was the case for Figure 3. We apologize for the confusion and corrected the description in the figure caption, which now reads as follows. The description in the main text and the methods remain unchanged.

      Caption Fig. 2: We plot the beta estimates obtained for the contrast between high vs. low feedback.

      Moreover, note that in response to comment 5 by reviewer 1 and comment 3 by reviewer 2, the revised manuscript now additionally reports the results obtained for the parametric regressor in the new Supplementary Figure 3C. All conclusions remain unchanged.

      Additionally, it is unclear how Figure 2A supports the following statement: "Moreover, the voxel-wise analysis revealed similar feedback-related activity in the thalamus and the striatum (Fig. 2A), and in the hippocampus when the feedback of the current trial was modeled (Fig. S3)." This is confusing as Figure 2A reports an opposite pattern of results between the striatum/thalamus and the hippocampus. It appears that the statement highlighted above is supported by results from a model including current trial feedback as a parametric modulator (reported in Figure S3).

      We agree with the reviewer that our result description was confusing and changed it. It now reads as follows.

      Page 5: Moreover, the voxel-wise analysis revealed feedback-related activity also in the thalamus and the striatum (Fig. 2A) [...]

      Also, note that it is unclear from Figure 2A what is the direction of the contrast highlighting the hippocampal cluster (high vs. low according to the text but the figure shows negative values in the hippocampus and positive values in the thalamus). These discrepancies need to be addressed and the models used to support the statements made in the results sections need to be explicitly described.

      The description of the contrast is correct. Negative values indicate smaller errors and therefore better feedback, which is mentioned in the caption of Fig. 2 as follows:

      “Negative values indicate that smaller errors, and higher-accuracy feedback, led to stronger activity.”

      Note that the timing error determined the feedback, and that we predicted stronger updating and therefore stronger activity for larger errors (similar to a prediction error). We found the opposite. We mention the reasoning behind this analysis at various locations in the manuscript e.g. when talking about the connectivity analysis:

      “We reasoned that larger timing errors and therefore low-accuracy feedback would result in stronger updating compared to smaller timing errors and high-accuracy feedback”

      In response to the reviewer’s remark, we clarified this further by adding the following statement to the result section.

      Page 5: “Using a mass-univariate general linear model (GLM), we modeled the three feedback levels with one regressor each plus additional nuisance regressors (see methods for details). The three feedback levels (high, medium and low accuracy) corresponded to small, medium and large timing errors, respectively. We then contrasted the beta weights estimated for high-accuracy vs. low-accuracy feedback and examined the effects on group-level averaged across runs.”

      b) Connectivity analyses: It is also unclear here which model was used in the PPIanalyses presented in Figure 2. As it appears that the seed region was extracted from a high vs. low contrast (without modulators), the PPI should be built using the same model. I assume this was the case as the authors mentioned "These co-fluctuations were stronger when participants performed poorly in the previous trial and therefore when they received low-accuracy feedback." if this refers to low vs. high contrast. Please clarify.

      Yes, the PPI model was built using the same model. We clarified this in the methods section by adding the following statement to the PPI description.

      Page 17: “The PPI model was built using the same model that revealed the main effects used to define the HPC sphere “

      Yes, the reviewer is correct in thinking that the contrast shows the difference between low vs. high-accuracy feedback. We clarified this in the main text as well as in the caption of Fig. 2.

      Caption Fig 2: [...] We plot results of a psychophysiological interactions (PPI) analysis conducted using the hippocampal peak effects in (A) as a seed for low vs. high-accuracy feedback. [...]

      Page 17: The estimated beta weight corresponding to the interaction term was then tested against zero on the group-level using a t-test implemented in SPM12 (Fig. 2C). The contrast reflects the difference between low vs. high-accuracy feedback. This revealed brain areas whose activity was co-varying with the hippocampus seed ROI as a function of past-trial performance (n-1).

      c) It is unclear why the model testing TTC-specific / TTC-independent effects (resultspresented in Figure 3) used 2 parametric modulators (as opposed to building two separate models with a different modulator each). I wonder how the authors dealt with the orthogonalization between parametric modulators with such a model. In SPM, the orthogonalization of parametric modulators is based on the order of the modulators in the design matrix. In this case, parametric modulator #2 would be orthogonalized to the preceding modulator so that a contrast focusing on the parametric modulator #2 would highlight any modulation that is above and beyond that explained by modulator #1. In this case, modulation of brain activity that is TTC-specific would have to be above and beyond a modulation that is TTC-independent to be highlighted. I am unsure that this is what the authors wanted to test here (or whether this is how the MRI design was built). Importantly, this might bias the interpretation of their results as - by design - it is less likely to observe TTC-specific modulations in the hippocampus as there is significant TTC-independent modulation. In other words, switching the order of the modulators in the model (or building two separate models) might yield different results. This is an important point to address as this might challenge the TTC-specific/TTC-independent results described in the manuscript.

      We thank the reviewer for raising this important issue. When running the respective analysis, we made sure that the regressors were not collinear and we therefore did not expect substantial overlap in shared variance between them. However, we agree with the reviewer that orthogonalizing one regressor with respect to the other could still affect the results. To make sure that our expectations were indeed met, we therefore repeated the main analysis twice: 1) switching the order of the modulators and 2) turning orthogonalization off (which is possible in SPM12 unlike in previous versions). In all cases, our key results and conclusions remained unchanged, including the central results of the hippocampus analyses.

      Anterior (ant.) / Posterior (post.) Hippocampus ROI analysis with A) original order of modulators, B) switching the order of the modulators and C) turning orthogonalization of modulators off. ABC) Orange color corresponds to the TTC-independent condition whereas light-blue color corresponds to the TTC-specific condition. Statistics reflect p<0.05 at Bonferroni corrected levels () obtained using a group-level one-tailed one-sample t-test against zero; A) pfwe = 0.017, B) pfwe = 0.039, C) pfwe = 0.039.*

      Because orthogonalization did not affect the conclusions, the new manuscript simply reports the analysis for which it was turned off. Note that these new figures are extremely similar to the original figures we presented, which can be seen in the exemplary figure below showing our key results at a liberal threshold for transparency. In addition, we clarified that orthogonalization was turned off in the methods section as follows.

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively, and they were not orthogonalized to each other.

      Comparison of old & new results: also see Fig. 3 and Fig. S5 in manuscript

      d) It is also unclear how the behavioral improvement was coded/classified "wecontrasted trials in which participants had improved versus the ones in which they had not improved or got worse"- It appears that improvement computation was based on the change of feedback valence (between high, medium and low). It is unclear why performance wasn't used instead? This would provide a finer-grained modulation?

      We thank the reviewer for the opportunity to clarify this important point. First, we chose to model feedback because it is the feedback that determines whether participants update their “internal model” or not. Without feedback, they would not know how well they performed, and we would not expect to find activity related to sensorimotor updating. Second, behavioral performance and received feedback are tightly correlated, because the former determines the latter. We therefore do not expect to see major differences in results obtained between the two. Third, we did in fact model both feedback and performance in two independent GLMs, even though the way the results were reported in the initial submission made it difficult to compare the two.

      Figure 4 shows the results obtained when modeling behavioral performance in the current trial as an F-contrast, and Supplementary Fig 4 shows the results when modeling the feedback received in the current trial as a t-contrast. While the voxel-wise t-maps/F-maps are also quite similar, we now additionally report the t-contrast for the behavioral-performance GLM in a new Supplementary Figure 4C. The t-maps obtained for these two different analyses are extremely similar, confirming that the direction of the effects as well as their interpretation remain independent of whether feedback or performance is modeled.

      The revised manuscript refers to the new Supplementary Figure 4C as follows.

      Page 17: In two independent GLMs, we analyzed the time courses of all voxels in the brain as a function of behavioral performance (i.e. TTC error) in each trial, and as a function of feedback received at the end of each trial. The models included one mean-centered parametric regressor per run, modeling either the TTC error or the three feedback levels in each trial, respectively. Note that the feedback itself was a function of TTC error in each trial [...] We estimated weights for all regressors and conducted a t-test against zero using SPM12 for our feedback and performance regressors of interest on the group level (Fig. S4A). [...]

      Page 17: In addition to the voxel-wise whole-brain analyses described above, we conducted independent ROI analyses for the anterior and posterior sections of the hippocampus (Fig. S2A). Here, we tested the beta estimates obtained in our first-level analysis for the feedback and performance regressors of interest (Fig. S4B; two-tailed one-sample t tests: anterior HPC, t(33) = -5.92, p = 1.2x10-6, pfwe = 2.4x10-6, d=-1.02, CI: [-1.45, -0.6]; posterior HPC, t(33) = -4.07, p = 2.7x10-4, pfwe = 5.4x10-4, d=-0.7, CI: [-1.09, -0.32]). See section "Regions of interest definition and analysis" for more details.

      If the feedback valence was used to classify trials as improved or not, how was this modelled (one regressor for improved, one for no improvement? As opposed to a parametric modulator with performance improvement?).

      We apologize for the lack of clarity regarding our regressor design. In response to this comment, we adapted the corresponding paragraph in the methods to express more clearly that improvement trials and no-improvement trials were modeled with two separate parametric regressors - in line with the reviewer’s understanding. The new paragraph reads as follows.

      Page 18: One regressor modeled the main effect of the trial and two parametric regressors modeled the following contrasts: Parametric regressor 1: trials in which behavioral performance improved \textit{vs}. parametric regressor 2: trials in which behavioral performance did not improve or got worse relative to the previous trial.

      Last, it is also unclear how ITI was modelled as a regressor. Did the authors mean a parametric modulator here? Some clarification on the events modelled would also be helpful. What was the onset of a trial in the MRI design? The start of the trial? Then end? The onset of the prediction time?

      The Inter-trial intervals (ITIs) were modeled as a boxcar regressor convolved with the hemodynamic response function. They describe the time after the feedback-phase offset and the subsequent trial onset. Moreover, the start of the trial was the moment when the visual-tracking target started moving after the ITI, whereas the trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The onset of the “prediction time” was the moment in which the visual-tracking target stopped moving, prompting participants to estimate the time-to-contact. We now explain this more clearly in the methods as shown below.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI), which were all convolved with the canonical hemodynamic response function of SPM12. The start of the trial was considered as the trial onsets for modeling (i.e. the time when the visual-tracking target started moving). The trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The ITI was the time between the offset of the feedback-phase and the subsequent trial onset.

      On a related note, in response to question 4 by reviewer 2, we now repeated one of the main analyses (Fig. 2) without modeling the ITI (as well as the Inter-session interval, ISI). We found that our key results and conclusions are independent of whether or not these time points were modeled. These new results are presented in the new Supplementary Figure 3B.

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. [...]

      1. Perhaps as a result of a lack of clarity in the result section and the MRI methods, it appears that some conclusions presented in the result section are not supported by the data. E.g. "Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time." The data show that hippocampal activity is higher during and after an accurate trial. This pattern of results could be attributed to various processes such as e.g. reward or learning etc. I would recommend not providing such interpretations in the result section and addressing these points in the discussion.

      Similar to above, statements like "These results suggest that the hippocampus updates information that is independent of the target TTC". The data show that higher hippocampal activity is linked to greater improvement across trials independent of the timing of the trial. The point about updating is rather speculative and should be presented in the discussion instead of the result section.

      The reviewer is referring to two statements in the results section that reflect our interpretation rather than a description of the results. In response to the reviewer’s comment, we therefore removed the following statement from the results.

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      In addition, we replaced the remaining statement by the following. We feel this new statement makes clear why we conducted the analysis that is described without offering an interpretation of the results that were presented before.

      Page 8: We reasoned that updating TTC-independent information may support generalization performance by means of regularizing the encoded intervals based on the temporal context in which they were encoded.

    1. Author Response:

      Reviewer #1 (Public Review):

      The manuscript provides very high quality single-cell physiology combined with population physiology to reveal distinctives roles for two anatomically dfferent LN populations in the cockroach antennal lobe. The conclusion that non-spiking LNs with graded responses show glomerular-restricted responses to odorants and spiking LNs show similar responses across glomeruli generally supported with strong and clean data, although the possibility of selective interglomerular inhibition has not been ruled out. On balance, the single-cell biophysics and physiology provides foundational information useful for well-grounded mechanistic understanding of how information is processed in insect antennal lobes, and how each LN class contributes to odor perception and behavior.

      Thank you for this positive feedback.

      Reviewer #2 (Public Review):

      The manuscript "Task-specific roles of local interneurons for inter- and intraglomerular signaling in the insect antennal lobe" evaluates the spatial distribution of calcium signals evoked by odors in two major classes of olfactory local neurons (LNs) in the cockroach P. Americana, which are defined by their physiological and morphological properties. Spiking type I LNs have a patchy innervation pattern of a subset of glomeruli, whereas non-spiking type II LNs innervate almost all glomeruli (Type II). The authors' overall conclusion is that odors evoke calcium signals globally and relatively uniformly across glomeruli in type I spiking LNs, and LN neurites in each glomerulus are broadly tuned to odor. In contrast, the authors conclude that they observe odor-specific patterns of calcium signals in type II nonspiking LNs, and LN neurites in different glomeruli display distinct local odor tuning. Blockade of action potentials in type I LNs eliminates global calcium signaling and decorrelates glomerular tuning curves, converting their response profile to be more similar to that of type II LNs. From these conclusions, the authors infer a primary role of type I LNs in interglomerular signaling and type III LNs in intraglomerular signaling.

      The question investigated by this study - to understand the computational significance of different types of LNs in olfactory circuits - is an important and significant problem. The design of the study is straightforward, but methodological and conceptual gaps raise some concerns about the authors' interpretation of their results. These can be broadly grouped into three main areas.

      1) The comparison of the spatial (glomerular) pattern of odor-evoked calcium signals in type I versus type II LNs may not necessarily be a true apples-to-apples comparison. Odor-evoked calcium signals are an order of magnitude larger in type I versus type II cells, which will lead to a higher apparent correlation in type I cells. In type IIb cells, and type I cells with sodium channel blockade, odor-evoked calcium signals are much smaller, and the method of quantification of odor tuning (normalized area under the curve) is noisy. Compare, for instance, ROI 4 & 15 (Figure 4) or ROI 16 & 23 (Figure 5) which are pairs of ROIs that their quantification concludes have dramatically different odor tuning, but which visual inspection shows to be less convincing. The fact that glomerular tuning looks more correlated in type IIa cells, which have larger, more reliable responses compared to type IIb cells, also supports this concern.

      We agree with the reviewer that "the comparison of the spatial (glomerular) pattern of odor-evoked calcium signals is not necessarily a true apples-to-apples comparison". Type I and type II LNs are different neuron types. Given their different physiology and morphology, this is not even close to a "true apples-to-apples comparison" - and a key point of the manuscript is to show just that.

      As we have emphasized in response to Essential Revision 1, the differences in Ca2+ signals are not an experimental shortcoming but a physiologically relevant finding per se. These data, especially when combined with the electrophysiological data, contribute to a better understanding of these neurons’ physiological and computational properties.

      It is physiologically determined that the Ca2+ signals during odorant stimulation in the type II LNs are smaller than in type I LNs. And yes, the signals are small because small postsynpathetic Ca2+ currents predominantly cause the signals. Regardless of the imaging method, this naturally reduces the signal-to-noise ratio, making it more challenging to detect signals. To address this issue, we used a well-defined and reproducible method for analyzing these signals. In this context, we do not agree with the very general criticism of the method. The reviewer questions whether the signals are odorant-induced or just noise (see also minor point 12). If we had recorded only noise, we would expect all tuning curves (for each odorant and glomerulus) to be the same. In this context, we disagree with the reviewer's statement that the tuning curves do not represent the Ca2+ signals in Figure 4 (ROI 4 and 15) and Figure 5 (ROI 16 and 23). This debate reflects precisely the kind of 'visual inspection bias' that our clearly defined analysis aims to avoid. On close inspection, the differences in Ca2+ signals can indeed be seen. Figure II (of this letter) shows the signals from the glomeruli in question at higher magnification. The sections of the recordings that were used for the tuning curves are marked in red.

      Figure II: Ca2+ signals of selected glomeruli that were questioned by the reviewer.

      2) An additional methodological issue that compounds the first concern is that calcium signals are imaged with wide-field imaging, and signals from each ROI likely reflect out of plane signals. Out of plane artifacts will be larger for larger calcium signals, which may also make it impossible to resolve any glomerular-specific signals in the type I LNs.

      Thank you for allowing us to clarify this point. The reviewer comment implies that the different amplitudes of the Ca2+ signals indicate some technical-methodological deficiency (poorly chosen odor concentration). But in fact, this is a key finding of this study that is physiologically relevant and crucial for understanding the function of the neurons studied. These very differences in the Ca2+ signals are evidence of the different roles these neurons play in AL. The different signal amplitudes directly show the distinct physiology and Ca2+ sources that dominate the Ca2+ signals in type I and type II LNs. Accordingly, it is impractical to equalize the magnitude of Ca2+ signals under physiological conditions by adjusting the concentration of odor stimuli.

      In the following, we address these issues in more detail: 1) Imaging Method 2) Odorant stimulation 3) Cell type-specific Ca2+ signals

      1) Imaging Method:

      Of course, we agree with the reviewer comment that out-of-focus and out-of-glomerulus fluorescence can potentially affect measurements, especially in widefield optical imaging in thick tissue. This issue was carefully addressed in initial experiments. In type I LNs, which innervate a subset of glomeruli, we detected fluorescence signals, which matched the spike pattern of the electrophysiological recordings 1:1, only in the innervated glomeruli. In the not innervated ROIs (glomeruli), we detected no or comparatively very little fluorescence, even in glomeruli directly adjacent to innervated glomeruli.

      To illustrate this, FIGURE I (of this response letter) shows measurements from an AL in which an uniglomerular projection neuron was investigated in an a set of experiments that were not directly related to the current study. In this experiment, a train of action potential was induced by depolarizing current. The traces show the action potential induced fluorescent signals from the innervated glomerulus (glomerulus #1) and the directly adjacent glomeruli.

      These results do not entirely exclude that the large Ca2+ signals from the innervated LN glomeruli may include out-of-focus and out-of-glomerulus fluorescence, but they do show that the bulk of the signal is generated from the recorded neuron in the respective glomeruli.

      Figure I: Simultaneous electrophysiological and optophysiological recordings of a uniglomerular projection using the ratiometric Ca2+ indicator fura-2. The projection neuron has its arborization in glomerulus 1. The train of action potentials was induced with a depolarizing current pulse (grey bar).

      2) Odorant Stimulation: It is important to note that the odorant concentration cannot be varied freely. For these experiments, the odorant concentrations have to be within a 'physiologically meaningful' range, which means: On the one hand, they have to be high enough to induce a clear response in the projection neurons (the antennal lobe output). On the other hand, however, the concentration was not allowed to be so high that the ORNs were stimulated nonspecifically. These criteria were met with the used concentrations since they induced clear and odorant-specific activity in projection neurons.

      3) Cell type-specific Ca2+ signals:

      The differences in Ca2+ signals are described and discussed in some detail throughout the text (e.g., page 6, lines 119-136; page 9, lines 193-198; page 10-11, lines 226-235; page 14-15, line 309-333). Briefly: In spiking type I LNs, the observed large Ca2+ signals are mediated mainly by voltage-depended Ca2+ channels activated by the Na+-driven action potential's strong depolarization. These large Ca2+ signals mask smaller signals that originate, for example, from excitatory synaptic input (i.e., evoked by ligand-activated Ca2+ conductances). Preventing the firing of action potentials can unmask the ligand-activated signals, as shown in Figure 4 (see also minor comments 8. and 10.). In nonspiking type II LNs, the action potential-generated Ca2+ signals are absent; accordingly, the Ca2+ signals are much smaller. In our model, the comparatively small Ca2+ signals in type II LNs are mediated mainly by (synaptic) ligand-gated Ca2+ conductances, possibly with contributions from voltage-gated Ca2+ channels activated by the comparatively small depolarization (compared with type I LNs).

      Accordingly, our main conclusion, that spiking LNs play a primary role in interglomerular signaling, while nonspiking LNs play an essential role in intraglomeular signaling, can be DIRECTLY inferred from the differences in odorant induced Ca2+ signals alone.

      a) Type I LN: The large, simultaneous, and uniform Ca2+ signals in the innervated glomeruli of an individual type I LN clearly show that they are triggered in each glomerulus by the propagated action potentials, which conclusively shows lateral interglomerular signal propagation.

      b) Type II LNs: In the type II LNs, we observed relatively small Ca2+ signals in single glomeruli or a small fraction of glomeruli of a given neuron. Importantly, the time course and amplitude of the Ca2+ signals varied between different glomeruli and different odors. Considering that type II LNs in principle, can generate large voltage-activated Ca2+ currents (larger that type I LNS; page 4, lines 82-86, Husch et al. 2009a,b; Fusca and Kloppenburg 2021), these data suggest that in type II LNs electrical or Ca2+ signals spread only within the same glomerulus; and laterally only to glomeruli that are electrotonically close to the odorant stimulated glomerulus.

      Taken together, this means that our conclusions regarding inter- and intraglomerular signaling can be derived from the simultaneously recorded amplitudes and the dynamics of the membrane potential and Ca2+ signals alone. This also means that although the correlation analyses support this conclusion nicely, the actual conclusion does not ultimately depend on the correlation analysis. We had (tried to) expressed this with the wording, “Quantitatively, this is reflected in the glomerulus-specific odorant responses and the diverse correlation coefficiiants across…” (page 10, lines 216-217) and “ …This is also reflected in the highly correlated tuning curves in type I LNs and low correlations between tuning curves in type II LNs”(page 13, lines 293-295).

      3) Apart from the above methodological concerns, the authors' interpretation of these data as supporting inter- versus intra-glomerular signaling are not well supported. The odors used in the study are general odors that presumably excite feedforward input to many glomeruli. Since the glomerular source of excitation is not determined, it's not possible to assign the signals in type II LNs as arising locally - selective interglomerular signal propagation is entirely possible. Likewise, the study design does not allow the authors to rule out the possibility that significant intraglomerular inhibition may be mediated by type I LNs.

      The reviewer addresses an important point. However, from the comment, we get the impression that he/she has not taken into account the entire data set and the DISCUSSION. In fact, this topic has already been discussed in some detail in the original version (page 12, lines 268-271; page 15-16; lines 358-374). This section even has a respective heading: "Inter- and intraglomerular signaling via nonspiking type II LNs" (page 15, line 338). We apologize if our explanations regarding this point were unclear, but we also feel that the reviewer is arguing against statements that we did not make in this way.

      a) In 11 out of 18 type II LNs we found 'relatively uncorrelated' (r=0.43±0.16, N=11) glomerular tuning curves. These experiments argue strongly for a 'local excitation' with restricted signal propagation and do not provide support for interglomerular signal propagation. Thus, these results support our interpretation of intraglomerular signaling in this set of neurons.

      b) In 7 out of 18 experiments, we observed 'higher correlated' glomerular tuning curves (r=0.78±0.07, N=7). We agree with the reviewer that this could be caused by various mechanisms, including simultaneous input to several glomeruli or by interglomerular signaling. Both possibilities were mentioned and discussed in the original version of the manuscript (page 12, lines 268-271; page 15-16; lines 358-374). In the Discussion, we considered the latter possibility in particular (but not exclusively) for the type IIa1 neurons that generate spikelets. Their comparatively stronger active membrane properties may be particularly suitable for selective signal transduction between glomeruli.

      c) We have not ruled out that local signaling exists in type I LNs – in addition to interglomerular signaling. The highly localized Ca2+ signals in type I LNs, which we observed when Na+ -driven action potential generation was prevented, may support this interpretation. However, we would like to reiterate that the simultaneous electrophysiological and optophysiological recordings, which show highly correlated glomerular Ca2+ dynamics that match 1:1 with the simultaneously recorded action potential pattern, clearly suggest interglomerular signaling. We also want to emphasize that this interpretation is in agreement with previous models derived from electrophysiological studies(Assisi et al., 2011; Fujiwara et al., 2014; Hong and Wilson, 2015; Nagel and Wilson, 2016; Olsen and Wilson, 2008; Sachse and Galizia, 2002; Wilson, 2013).

      In light of the reviewer's comment(s), we have modified the text to clarify these points (page 14, lines 317-319).

      Reviewer #3 (Public Review):

      To elucidate the role of the two types of LNs, the authors combined whole-cell patch clamp recordings with calcium imaging via single cell dye injection. This method enables to monitor calcium dynamics of the different axons and branches of single LNs in identified glomeruli of the antennal lobe, while the membrane potential can be recorded at the same time. The authors recorded in total from 23 spiking (type I LN) and 18 non-spiking (type II LN) neurons to a set of 9 odors and analyzed the firing pattern as well as calcium signals during odor stimulation for individual glomeruli. The recordings reveal on one side that odor-evoked calcium responses of type I LNs are odor-specific, but homogeneous across glomeruli and therefore highly correlated regarding the tuning curves. In contrast, odor-evoked responses of type II LNs show less correlated tuning patterns and rather specific odor-evoked calcium signals for each glomerulus. Moreover the authors demonstrate that both LN types exhibit distinct glomerular branching patterns, with type I innervating many, but not all glomeruli, while type II LNs branch in all glomeruli.

      From these results and further experiments using pharmacological manipulation, the authors conclude that type I LNs rather play a role regarding interglomerular inhibition in form of lateral inhibition between different glomeruli, while type II LNs are involved in intraglomerular signaling by developing microcircuits in individual glomeruli.

      In my opinion the methodological approach is quite challenging and all subsequent analyses have been carried out thoroughly. The obtained data are highly relevant, but provide rather an indirect proof regarding the distinct roles of the two LN types investigated. Nevertheless, the conclusions are convincing and the study generally represents a valuable and important contribution to our understanding of the neuronal mechanisms underlying odor processing in the insect antennal lobe. I think the authors should emphasize their take-home messages and resulting conclusions even stronger. They do a good job in explaining their results in their discussion, but need to improve and highlight the outcome and meaning of their individual experiments in their results section.

      Thank you for this positive feedback.

      References:

      Assisi, C., Stopfer, M., Bazhenov, M., 2011. Using the structure of inhibitory networks to unravel mechanisms of spatiotemporal patterning. Neuron 69, 373–386. https://doi.org/10.1016/j.neuron.2010.12.019

      Das, S., Trona, F., Khallaf, M.A., Schuh, E., Knaden, M., Hansson, B.S., Sachse, S., 2017. Electrical synapses mediate synergism between pheromone and food odors in Drosophila melanogaster . Proc Natl Acad Sci U S A 114, E9962–E9971. https://doi.org/10.1073/pnas.1712706114

      Fujiwara, T., Kazawa, T., Haupt, S.S., Kanzaki, R., 2014. Postsynaptic odorant concentration dependent inhibition controls temporal properties of spike responses of projection neurons in the moth antennal lobe. PLOS ONE 9, e89132. https://doi.org/10.1371/journal.pone.0089132

      Fusca, D., Husch, A., Baumann, A., Kloppenburg, P., 2013. Choline acetyltransferase-like immunoreactivity in a physiologically distinct subtype of olfactory nonspiking local interneurons in the cockroach (Periplaneta americana). J Comp Neurol 521, 3556–3569. https://doi.org/10.1002/cne.23371

      Fuscà, D., and Kloppenburg, P. (2021). Odor processing in the cockroach antennal lobe-the network components. Cell Tissue Res.

      Hong, E.J., Wilson, R.I., 2015. Simultaneous encoding of odors by channels with diverse sensitivity to inhibition. Neuron 85, 573–589. https://doi.org/10.1016/j.neuron.2014.12.040

      Husch, A., Paehler, M., Fusca, D., Paeger, L., Kloppenburg, P., 2009a. Calcium current diversity in physiologically different local interneuron types of the antennal lobe. J Neurosci 29, 716–726. https://doi.org/10.1523/JNEUROSCI.3677-08.2009

      Husch, A., Paehler, M., Fusca, D., Paeger, L., Kloppenburg, P., 2009b. Distinct electrophysiological properties in subtypes of nonspiking olfactory local interneurons correlate with their cell type-specific Ca2+ current profiles. J Neurophysiol 102, 2834–2845. https://doi.org/10.1152/jn.00627.2009

      Nagel, K.I., Wilson, R.I., 2016. Mechanisms Underlying Population Response Dynamics in Inhibitory Interneurons of the Drosophila Antennal Lobe. J Neurosci 36, 4325–4338. https://doi.org/10.1523/JNEUROSCI.3887-15.2016

      Neupert, S., Fusca, D., Kloppenburg, P., Predel, R., 2018. Analysis of single neurons by perforated patch clamp recordings and MALDI-TOF mass spectrometry. ACS Chem Neurosci 9, 2089–2096.

      Olsen, S.R., Bhandawat, V., Wilson, R.I., 2007. Excitatory interactions between olfactory processing channels in the Drosophila antennal lobe. Neuron 54, 89–103. https://doi.org/10.1016/j.neuron.2007.03.010

      Olsen, S.R., Wilson, R.I., 2008. Lateral presynaptic inhibition mediates gain control in an olfactory circuit. Nature 452, 956–960. https://doi.org/10.1038/nature06864

      Sachse, S., Galizia, C., 2002. Role of inhibition for temporal and spatial odor representation in olfactory output neurons: a calcium imaging study. J Neurophysiol. 87, 1106–17.

      Shang, Y., Claridge-Chang, A., Sjulson, L., Pypaert, M., Miesenbock, G., 2007. Excitatory Local Circuits and Their Implications for Olfactory Processing in the Fly Antennal Lobe. Cell 128, 601–612.

      Wilson, R.I., 2013. Early olfactory processing in Drosophila: mechanisms and principles. Annu Rev Neurosci 36, 217–241. https://doi.org/10.1146/annurev-neuro-062111-150533

      Yaksi, E., Wilson, R.I., 2010. Electrical coupling between olfactory glomeruli. Neuron 67, 1034–1047. https://doi.org/10.1016/j.neuron.2010.08.041

    1. Author Response

      Reviewer #1 (Public Review):

      In computational modeling studies of behavioral data using reinforcement learning models, it has been implicitly assumed that parameter estimates generalize across tasks (generalizability) and that each parameter reflects a single cognitive function (interpretability). In this study, the authors examined the validity of these assumptions through a detailed analysis of experimental data across multiple tasks and age groups. The results showed that some parameters generalize across tasks, while others do not, and that interpretability is not sufficient for some parameters, suggesting that the interpretation of parameters needs to take into account the context of the task. Some researchers may have doubted the validity of these assumptions, but to my knowledge, no study has explicitly examined their validity. Therefore, I believe this research will make an important contribution to researchers who use computational modeling. In order to clarify the significance of this research, I would like the authors to consider the following points.

      1) Effects of model misspecification

      In general, model parameter estimates are influenced by model misspecification. Specifically, if components of the true process are not included in the model, the estimates of other parameters may be biased. The authors mentioned a little about model misspecification in the Discussion section, but they do not mention the possibility that the results of this study itself may be affected by it. I think this point should be discussed carefully.

      The authors stated that they used state-of-the-art RL models, but this does not necessarily mean that the models are correctly specified. For example, it is known that if there is history dependence in the choice itself and it is not modeled properly, the learning rates depending on valence of outcomes (alpha+, alpha-) are subject to biases (Katahira, 2018, J Math Pscyhol). In the authors' study, the effect of one previous choice was included in the model as choice persistence, p. However, it has been pointed out that not including the effect of a choice made more than two trials ago in the model can also cause bias (Katahira, 2018). The authors showed taht the learning rate for positive RPE, alpha+ was inconsistent across tasks. But since choice persistence was included only in Task B, it is possible that the bias of alpha+ was different between tasks due to individual differences in choice persistence, and thus did not generalize.

      However, I do not believe that it is necessary to perform a new analysis using the model described above. As for extending the model, I don't think it is possible to include all combinations of possible components. As is often said, every model is wrong, and only to varying degrees. What I would like to encourage the authors to do is to discuss such issues and then consider their position on the use of the present model. Even if the estimation results of this model are affected by misspecification, it is a fact that such a model is used in practice, and I think it is worthwhile to discuss the nature of the parameter estimates.

      We thank the reviewer for this thoughtful question, and have added the following paragraph to the discussion section that is aims to address it:

      “Another concern relates to potential model misspecification and its effects on model parameter estimates: If components of the true data-generating process are not included in a model (i.e., a model is misspecified), estimates of existing model parameters may be biased. For example, if choices have an outcome-independent history dependence that is not modeled properly, learning rate parameters have shown to be biased [63]. Indeed, we found that learning rate parameters were inconsistent across the tasks in our study, and two of our models (A and C) did not model history dependence in choice, while the third (model B) only included the effect of one previous choice (persistence parameter), but no multi-trial dependencies. It is hence possible that the differences in learning rate parameters between tasks were caused by differences in the bias induced by misspecification of history dependence, rather than a lack of generalization. Though pressing, however, this issue is difficult to resolve in practicality, because it is impossible to include all combinations of possible parameters in all computational models, i.e., to exhaustively search the space of possible models ("Every model is wrong, but to varying degrees"). Furthermore, even though our models were likely affected by some degree of misspecification, the research community is currently using models of this kind. Our study therefore sheds light on generalizability and interpretability in a realistic setting, which likely includes models with varying degrees of misspecification. Lastly, our models were fitted using robust computational tools and achieved good behavioral recovery (Fig. D.7), which also reduces the likelihood of model misspecification.“

      2) Issue of reliability of parameter estimates

      I think it is important to consider not only the bias in the parameter estimates, but also the issue of reliability, i.e., how stable the estimates will be when the same task is repeated with the same individual. For the task used in this study, has test-retest reliability been examined in previous studies? I think that parameters with low reliability will inevitably have low generalizability to other tasks. In this study, the use of three tasks seems to have addressed this issue without explicitly considering the reliability, but I would like the author to discuss this issue explicitly.

      We thank the reviewer for this useful comment, and have added the following paragraph to the discussion section to address it:

      “Furthermore, parameter generalizability is naturally bounded by parameter reliability, i.e., the stability of parameter estimates when participants perform the same task twice (test-retest reliability) or when estimating parameters from different subsets of the same dataset (split-half reliability). The reliability of RL models has recently become the focus of several parallel investigations [...], some employing very similar tasks to ours [...]. The investigations collectively suggest that excellent reliability can often be achieved with the right methods, most notably by using hierarchical model fitting. Reliability might still differ between tasks or models, potentially being lower for learning rates than other RL parameters [...], and differing between tasks (e.g., compare [...] to [...]). In this study, we used hierarchical fitting for tasks A and B and assessed a range of qualitative and quantitative measures of model fit for each task [...], boosting our confidence in high reliability of our parameter estimates, and the conclusion that the lack of between-task parameter correlations was not due to a lack of parameter reliability, but a lack of generalizability. This conclusion is further supported by the fact that larger between-task parameter correlations (r>0.5) than those observed in humans were attainable---using the same methods---in a simulated dataset with perfect generalization.“

      3) About PCA

      In this paper, principal component analysis (PCA) is used to extract common components from the parameter estimates and behavioral features across tasks. When performing PCA, were each parameter estimate and behavioral feature standardized so that the variance would be 1? There was no mention about this. It seems that otherwise the principal components would be loaded toward the features with larger variance. In addition, Moutoussis et al. (Neuron, 2021, 109 (12), 2025-2040) conducted a similar analysis of behavioral parameters of various decision-making tasks, but they used factor analysis instead of PCA. Although the authors briefly mentioned factor analysis, it would be better if they also mentioned the reason why they used PCA instead of factor analysis, which can consider unique variances.

      To answer the reviewer's first question: We indeed standardized all features before performing the PCA. Apologies for missing to include this information - we have now added a corresponding sentence to the methods sections.

      We also thank the reviewer for the mentioned reference, which is very relevant to our findings and can help explain the roles of different PCs. Like in our study, Moutoussis et al. found a first PC that captured variability in task performance, and subsequent PCs that captured task contrasts. We added the following paragraph to our manuscript:

      “PC1 therefore captured a range of "good", task-engaged behaviors, likely related to the construct of "decision acuity" [...]. Like our PC1, decision acuity was the first component of a factor analysis (variant of PCA) conducted on 32 decision-making measures on 830 young people, and separated good and bad performance indices. Decision acuity reflects generic decision-making ability, and predicted mental health factors, was reflected in resting-state functional connectivity, but was distinct from IQ [...].”

      To answer the reviewer's question about PCA versus FA, both approaches are relatively similar conceptually, and oftentimes share the majority of the analysis pipeline in practice. The main difference is that PCA breaks up the existing variance in a dataset in a new way (based on PCs rather than the original data features), whereas FA aims to identify an underlying model of latent factors that explain the observable features. This means that PCs are linear combinations of the original data features, whereas Factors are latent factors that give rise to the observable features of the dataset with some noise, i.e., including an additional error term.

      However, in practice, both methods share the majority of computation in the way they are implemented in most standard statistical packages: FA is usually performed by conducting a PCA and then rotating the resulting solution, most commonly using the Varimax rotation, which maximizes the variance between features loadings on each factor in order to make the result more interpretable, and thereby foregoing the optimal solution that has been achieved by the PCA (which lack the error term). Maximum variance in feature loadings means that as many features as possible will have loadings close to 0 and 1 on each factor, reducing the number of features that need to be taken into account when interpreting this factor. Most relevant in our situation is that PCA is usually a special case of FA, with the only difference that the solution is not rotated for maximum interpretability. (Note that this rotation can be minor if feature loadings already show large variance in the PCA solution.)

      To determine how much our results would change in practice if we used FA instead of PCA, we repeated the analysis using FA. Both are shown side-by-side below, and the results are quite similar:

      We therefore conclude that our specific results are robust to the choice of method used, and that there is reason to believe that our PC1 is related to Moutoussis et al.’s F1 despite the differences in method.

      Reviewer #2 (Public Review):

      I am enthusiastic about the comprehensive approach, the thorough analysis, and the intriguing findings. This work makes a timely contribution to the field and warrants a wider discussion in the community about how computational methods are deployed and interpreted. The paper is also a great and rare example of how much can be learned from going beyond a meta-analytic approach to systematically collect data that assess commonly held assumptions in the field, in this case in a large data-driven study across multiple tasks. My only criticism is that at times, the paper misses opportunities to be more constructive in pinning down exactly why authors observe inconsistencies in parameter fits and interpretation. And the somewhat pessimistic outlook relies on some results that are, in my view at least, somewhat expected based on what we know about human RL. Below I summarize the major ways in which the paper's conclusions could be strengthened.

      One key point the authors make concerns the generalizability of absolute vs. relative parameter values. It seems that at least in the parameter space defined by +LRs and exploration/noise (which are known to be mathematically coupled), subjects clustered similarly for tasks A and C. In other words, as the authors state, "both learning rate and inverse temperature generalized in terms of the relationships they captured between participants". This struck me as a more positive and important result than it was made out to be in the paper, for several reasons:

      • As authors point out in the discussion, a large literature on variable LRs has shown that people adapt their learning rates trial-by-trial to the reward function of the environment; given this, and given that all models tested in this work have fixed learning rates, while the three tasks vary on the reward function, the comparison of absolute values seems a bit like a red-herring.

      We thank the reviewers for this recommendation and have reworked the paper substantially to address the issue. We have modified the highlights, abstract, introduction, discussion, conclusion, and relevant parts of the results section to provide equal weight to the successes and failures of generalization.

      Highlights:

      ● “RL decision noise/exploration parameters generalize in terms of between-participant variation, showing similar age trajectories across tasks.”

      ● “These findings are in accordance with previous claims about the developmental trajectory of decision noise/exploration parameters.”

      Abstract:

      ● “We found that some parameters (exploration / decision noise) showed significant generalization: they followed similar developmental trajectories, and were reciprocally predictive between tasks.“

      The introduction now introduces different potential outcomes of our study with more equal weight:

      “Computational modeling enables researchers to condense rich behavioral datasets into simple, falsifiable models (e.g., RL) and fitted model parameters (e.g., learning rate, decision temperature) [...]. These models and parameters are often interpreted as a reflection of ("window into") cognitive and/or neural processes, with the ability to dissect these processes into specific, unique components, and to measure participants' inherent characteristics along these components.

      For example, RL models have been praised for their ability to separate the decision making process into value updating and choice selection stages, allowing for the separate investigation of each dimension. Crucially, many current research practices are firmly based on these (often implicit) assumptions, which give rise to the expectation that parameters have a task- and model-independent interpretation and will seamlessly generalize between studies. However, there is growing---though indirect---evidence that these assumptions might not (or not always) be valid.

      The following section lays out existing evidence in favor and in opposition of model generalizability and interpretability. Building on our previous opinion piece, which---based on a review of published studies---argued that there is less evidence for model generalizability and interpretability than expected based on current research practices [...], this study seeks to directly address the matter empirically.”

      We now also provide more even evidence for both potential outcomes:

      “Many current research practices are implicitly based on the interpretability and generalizability of computational model parameters (despite the fact that many researchers explicitly distance themselves from these assumptions). For our purposes, we define a model variable (e.g., fitted parameter, reward-prediction error) as generalizable if it is consistent across uses, such that a person would be characterized with the same values independent of the specific model or task used to estimate the variable. Generalizability is a consequence of the assumption that parameters are intrinsic to participants rather than task dependent (e.g., a high learning rate is a personal characteristic that might reflect an individual's unique brain structure). One example of our implicit assumptions about generalizability is the fact that we often directly compare model parameters between studies---e.g., comparing our findings related to learning-rate parameters to a previous study's findings related to learning-rate parameters. Note that such a comparison is only valid if parameters capture the same underlying constructs across studies, tasks, and model variations, i.e., if parameters generalize. The literature has implicitly equated parameters in this way in review articles [...], meta-analyses [...], and also most empirical papers, by relating parameter-specific findings across studies. We also implicitly evoke parameter generalizability when we study task-independent empirical parameter priors [...], or task-independent parameter relationships (e.g., interplay between different kinds of learning rates [...]), because we presuppose that parameter settings are inherent to participants, rather than task specific.

      We define a model variable as interpretable if it isolates specific and unique cognitive elements, and/or is implemented in separable and unique neural substrates. Interpretability follows from the assumption that the decomposition of behavior into model parameters "carves cognition at its joints", and provides fundamental, meaningful, and factual components (e.g., separating value updating from decision making). We implicitly invoke interpretability when we tie model variables to neural substrates in a task-general way (e.g., reward prediction errors to dopamine function [...]), or when we use parameters as markers of psychiatric conditions (e.g., working-memory parameter and schizophrenia [...]). Interpretability is also required when we relate abstract parameters to aspects of real-world decision making [...], and generally, when we assume that model variables are particularly "theoretically meaningful" [...].

      However, in midst the growing recognition of computational modeling, the focus has also shifted toward inconsistencies and apparent contradictions in the emerging literature, which are becoming apparent in cognitive [...], developmental [...], clinical [...], and neuroscience studies [...], and have recently become the focus of targeted investigations [...]. For example, some developmental studies have shown that learning rates increased with age [...], whereas others have shown that they decrease [...]. Yet others have reported U-shaped trajectories with either peaks [...] or troughs [...] during adolescence, or stability within this age range [...] (for a comprehensive review, see [...]; for specific examples, see [...]). This is just one striking example of inconsistencies in the cognitive modeling literature, and many more exist [...]. These inconsistencies could signify that computational modeling is fundamentally flawed or inappropriate to answer our research questions. Alternatively, inconsistencies could signify that the method is valid, but our current implementations are inappropriate [...]. However, we hypothesize that inconsistencies can also arise for a third reason: Even if both method and implementation are appropriate, inconsistencies like the ones above are expected---and not a sign of failure---if implicit assumptions of generalizability and interpretability are not always valid. For example, model parameters might be more context-dependent and less person-specific that we often appreciate [...]“

      In the results section, we now highlight findings more that are compatible with generalization: “For α+, adding task as a predictor did not improve model fit, suggesting that α+ showed similar age trajectories across tasks (Table 2). Indeed, α+ showed a linear increase that tapered off with age in all tasks (linear increase: task A: β = 0.33, p < 0.001; task B: β = 0.052, p < 0.001; task C: β = 0.28, p < 0.001; quadratic modulation: task A: β = −0.007, p < 0.001; task B: β = −0.001, p < 0.001; task C: β = −0.006, p < 0.001). For noise/exploration and Forgetting parameters, adding task as a predictor also did not improve model fit (Table 2), suggesting similar age trajectories across tasks.”

      “For both α+ and noise/exploration parameters, task A predicted tasks B and C, and tasks B and C predicted task A, but tasks B and C did not predict each other (Table 4; Fig. 2D), reminiscent of the correlation results that suggested successful generalization (section 2.1.2).”

      “Noise/exploration and α+ showed similar age trajectories (Fig. 2C) in tasks that were sufficiently similar (Fig. 2D).” And with respect to our simulation analysis (for details, see next section):

      “These results show that our method reliably detected parameter generalization in a dataset that exhibited generalization. ”

      We also now provide more nuance in our discussion of the findings:

      “Both generalizability [...] and interpretability (i.e., the inherent "meaningfulness" of parameters) [...] have been explicitly stated as advantages of computational modeling, and many implicit research practices (e.g., comparing parameter-specific findings between studies) showcase our conviction in them [...]. However, RL model generalizability and interpretability has so far eluded investigation, and growing inconsistencies in the literature potentially cast doubt on these assumptions. It is hence unclear whether, to what degree, and under which circumstances we should assume generalizability and interpretability. Our developmental, within-participant study revealed a nuanced picture: Generalizability and interpretability differed from each other, between parameters, and between tasks.”

      “Exploration/noise parameters showed considerable generalizability in the form of correlated variance and age trajectories. Furthermore, the decline in exploration/noise we observed between ages 8-17 was consistent with previous studies [13, 66, 67], revealing consistency across tasks, models, and research groups that supports the generalizability of exploration / noise parameters. However, for 2/3 pairs of tasks, the degree of generalization was significantly below the level of generalization expected for perfect generalization. Interpretability of exploration / noise parameters was mixed: Despite evidence for specificity in some cases (overlap in parameter variance between tasks), it was missing in others (lack of overlap), and crucially, parameters lacked distinctiveness (substantial overlap in variance with other parameters).”

      “Taken together, our study confirms the patterns of generalizable exploration/noise parameters and task-specific learning rate parameters that are emerging from the literature [13].”

      • Regarding the relative inferred values, it's unclear how high we really expect correlations between the same parameter across tasks to be. E.g., if we take Task A and make a second, hypothetical, Task B by varying one feature at a time (say, stochasticity in reward function), how correlated are the fitted LRs going to be? Given the different sources of noise in the generative model of each task and in participant behavior, it is hard to know whether a correlation coefficient of 0.2 is "good enough" generalizability.

      We thank the reviewer for this excellent suggestion, which we think helped answer a central question that our previous analyses had failed to address, and also provided answers to several other concerns raised by both reviewers in other section. We have conducted these additional analyses as suggested, simulating artificial behavioral data for each task, fitting these data using the models used in humans, repeating the analyses performed on humans on the new fitted parameters, and using bootstrapping to statistically compare humans to the hence obtained ceiling of generalization. We have added the following section to our paper, which describes the results in detail:

      “Our analyses so far suggest that some parameters did not generalize between tasks, given differences in age trajectories (section 2.1.3) and a lack of mutual prediction (section 2.1.4). However, the lack of correspondence could also arise due to other factors, including behavioral noise, noise in parameter fitting, and parameter trade-offs within tasks. To rule these out, we next established the ceiling of generalizability attainable using our method.

      We established the ceiling in the following way: We first created a dataset with perfect generalizability, simulating behavior from agents that use the same parameters across all tasks (suppl. Appendix E). We then fitted this dataset in the same way as the human dataset (e.g., using the same models), and performed the same analyses on the fitted parameters, including an assessment of age trajectories (suppl. Table E.8) and prediction between tasks (suppl. Tables E.9, E.10, and E.11). These results provide the practical ceiling of generalizability. We then compared the human results to this ceiling to ensure that the apparent lack of generalization was valid (significant difference between humans and ceiling), and not in accordance with generalization (lack of difference between humans and ceiling).

      Whereas humans had shown divergent trajectories for parameter alpha- (Fig. 2B; Table 1), the simulated agents did not show task differences for alpha- or any other parameter (suppl. Fig E.8B; suppl. Table E.8, even when controlling for age (suppl. Tables E.9 and E.10), as expected from a dataset of generalizing agents. Furthermore, the same parameters were predictive between tasks in all cases (suppl. Table E.11). These results show that our method reliably detected parameter generalization in a dataset that exhibited generalization.

      Lastly, we established whether the degree of generalization in humans was significantly different from agents. To this aim, we calculated the Spearman correlations between each pair of tasks for each parameter, for both humans (section 2.1.2; suppl. Fig. H.9) and agents, and compared both using bootstrapped confidence intervals (suppl. Appendix E). Human parameter correlations were significantly below the ceiling for all parameters except alpha+ (A vs B) and epsilon / 1/beta (A vs C; suppl. Fig. E.8C). This suggests that humans were within the range of maximally detectable generalization in two cases, but showed less-than-perfect generalization between other task combinations and for parameters Forgetting and alpha-.”

      • The +LR/inverse temp relationship seems to generalize best between tasks A/C, but not B/C, a common theme in the paper. This does not seem surprising given that in A and C there is a key additional task feature over the bandit task in B -- which is the need to retain state-action associations. Whether captured via F (forgetting) or K (WM capacity), the cognitive processes involved in this learning might interact with LR/exploration in a different way than in a task where this may not be necessary.

      We thank the reviewer for this comment, which raises an important issue. We are adding the specific pairwise correlations and scatter plots for the pairs of parameters the reviewer asked about below (“bf_alpha” = LR task A; “bf_forget” = F task A; “rl_forget” = F task C; “rl_log_alpha” = LR task C; “rl_K” = WM capacity task C):

      Within tasks:

      Between tasks:

      To answer the question in more detail, we have expanded our section about limitations stemming from parameter tradeoffs in the following way:

      “One limitation of our results is that regression analyses might be contaminated by parameter cross-correlations (sections 2.1.2, 2.1.3, 2.1.4), which would reflect modeling limitations (non-orthogonal parameters), and not necessarily shared cognitive processes. For example, parameters alpha and beta are mathematically related in the regular RL modeling framework, and we observed significant within-task correlations between these parameters for two of our three tasks (suppl. Fig. H.10, H.11). This indicates that caution is required when interpreting correlation results. However, correlations were also present between tasks (suppl. Fig. H.9, H.11), suggesting that within-model trade-offs were not the only explanation for shared variance, and that shared cognitive processes likely also played a role.

      Another issue might arise if such parameter cross-correlations differ between models, due to the differences in model parameterizations across tasks. For example, memory-related parameters (e.g., F, K in models A and C) might interact with learning- and choice-related parameters (e.g., alpha+, alpha-, noise/exploration), but such an interaction is missing in models that do not contain memory-related parameters (e.g., task B). If this indeed the case, i.e., parameters trade off with each other in different ways across tasks, then a lack of correlation between tasks might not reflect a lack of generalization, but just the differences in model parameterizations. Suppl. Fig. \ref{figure:S2AlphaBetaCorrelations} indeed shows significant, medium-sized, positive and negative correlations between several pairs of Forgetting, memory-related, learning-related, and exploration parameters (though with relatively small effect sizes; Spearman correlation: 0.17 < |r| < 0.22).

      The existence of these correlations (and differences in correlations between tasks) suggest that memory parameters likely traded off with each other, as well as with other parameters, which potentially affected generalizability across tasks. However, some of the observed correlations might be due to shared causes, such as a common reliance on age, and the regression analyses in the main paper control for these additional sources of variance, and might provide a cleaner picture of how much variance is actually shared between parameters.

      Furthermore, correlations between parameters within models are frequent in the existing literature, and do not prevent researchers from interpreting parameters---in this sense, the existence of similar correlations in our study allows us to address the question of generalizability and interpretability in similar circumstances as in the existing literature.”

      • More generally, isn't relative generalizability the best we would expect given systematic variation in task context? I agree with the authors' point that the language used in the literature sometimes implies an assumption of absolute generalizability (e.g. same LR across any task). But parameter fits, interactions, and group differences are usually interpreted in light of a single task+model paradigm, precisely b/c tasks vary widely across critical features that will dictate whether different algorithms are optimal or not and whether cognitive functions such as WM or attention may compensate for ways in which humans are not optimal. Maybe a more constructive approach would be to decompose tasks along theoretically meaningful features of the underlying Markov Decision Process (which gives a generative model), and be precise about (1) which features we expect will engage additional cognitive mechanisms, and (2) how these mechanisms are reflected in model parameters.

      We thank the reviewer for this comment, and will address both points in turn:

      (1) We agree with the reviewer's sentiment about relative generalizability: If we all interpreted our models exclusively with respect to our specific task design, and never expected our results to generalize to other tasks or models, there would not be a problem. However, the current literature shows a different pattern: Literature reviews, meta-analyses, and discussion sections of empirical papers regularly compare specific findings between studies. We compare specific parameter values (e.g., empirical parameter priors), parameter trajectories over age, relationships between different parameters (e.g., balance between LR+ and LR-), associations between parameters and clinical symptoms, and between model variables and neural measures on a regular basis. The goal of this paper was really to see if and to what degree this practice is warranted. And the reviewer rightfully alerted us to the fact that our data imply that these assumptions might be valid in some cases, just not in others.

      (2) With regard to providing task descriptions that relate to the MDP framework, we have included the following sentence in the discussion section:

      “Our results show that discrepancies are expected even with a consistent methodological pipeline, and using up-to-date modeling techniques, because they are an expected consequence of variations in experimental tasks and computational models (together called "context"). Future research needs to investigate these context factors in more detail. For example, which task characteristics determine which parameters will generalize and which will not, and to what extent? Does context impact whether parameters capture overlapping versus distinct variance? A large-scale study could answer these questions by systematically covering the space of possible tasks, and reporting the relationships between parameter generalizability and distance between tasks. To determine the distance between tasks, the MDP framework might be especially useful because it decomposes tasks along theoretically meaningful features of the underlying Markov Decision Process.“

      Another point that merits more attention is that the paper pretty clearly commits to each model as being the best possible model for its respective task. This is a necessary premise, as otherwise, it wouldn't be possible to say with certainty that individual parameters are well estimated. I would find the paper more convincing if the authors include additional information and analysis showing that this is actually the case.

      We agree with the sentiment that all models should fit their respective task equally well. However, there is no good quantitative measure of model fit that is comparable across tasks and models - for example, because of the difference in difficulty between the tasks, the number of choices explained would not be a valid measure to compare how well the models are doing across tasks. To address this issue, we have added the new supplemental section (Appendix C) mentioned above that includes information about the set of models compared, and explains why we have reason to believe that all models fit (equally) well. We also created the new supplemental Figure D.7 shown above, which directly compares human and simulated model behavior in each task, and shows a close correspondence for all tasks. Because the quality of all our models was a major concern for us in this research, we also refer the reviewer and other readers to the three original publications that describe all our modeling efforts in much more detail, and hopefully convince the reviewer that our model fitting was performed according to high standards.

      I am particularly interested to see whether some of the discrepancies in parameter fits can be explained by the fact that the model for Task A did not account for explicit WM processes, even though (1) Task A is similar to Task C (Task A can be seen as a single condition of Task C with 4 states and 2 possible visible actions, and stochastic rather than deterministic feedback) and (2) prior work has suggested a role for explicit memory of single episodes even in stateless bandit tasks such as Task B.

      We appreciate this very thoughtful question, which raises several important issues. (1) As the reviewer said, the models for task A and task C are relatively different even though the underlying tasks are relatively similar (minus the differences the reviewer already mentioned, in terms of visibility of actions, number of actions, and feedback stochasticity). (2) We also agree that the model for task C did not include episodic memory processes even though episodic memory likely played a role in this task, and agree that neither the forgetting parameters in tasks A and C, nor the noise/exploration parameters in tasks A, B, and C are likely specific enough to capture all the memory / exploration processes participants exhibited in these tasks.

      However, this problem is difficult to solve: We cannot fit an episodic-memory model to task B because the task lacks an episodic-memory manipulation (such as, e.g., in Bornstein et al., 2017), and we cannot fit a WM model to task A because it lacks the critical set-size manipulation enabling identification of the WM component (modifying set size allows the model to identify individual participants’ WM capacities, so the issue cannot be avoided in tasks with only one set size). Similarly, we cannot model more specific forgetting or exploration processes in our tasks because they were not designed to dissociate these processes. If we tried fitting more complex models that include these processes to these tasks, they would most likely lose in model comparison because the increased complexity would not lead to additional explained behavioral variance, given that the tasks do not elicit the relevant behavioral patterns. Because the models therefore do not specify all the cognitive processes that participants likely employ, the situation described by the reviewer arises, namely that different parameters sometimes capture the same cognitive processes across tasks and models, while the same parameters sometimes capture different processes.

      And while the reviewer focussed largely on memory-related processes, the issue of course extends much further: Besides WM, episodic memory, and more specific aspects of forgetting and exploration, our models also did not take into account a range of other processes that participants likely engaged in when performing the tasks, including attention (selectivity, lapses), reasoning / inference, mental models (creation and use), prediction / planning, hypothesis testing, etc., etc. In full agreement with the reviewer’s sentiment, we recently argued that this situation is ubiquitous to computational modeling, and should be considered very carefully by all modelers because it can have a large impact on model interpretation (Eckstein et al., 2021).

      If we assume that many more cognitive processes are likely engaged in each task than are modeled, and consider that every computational model includes just a small number of free parameters, parameters then necessarily reflect a multitude of cognitive processes. The situation is additionally exacerbated by the fact that more complex models become increasingly difficult to fit from a methodological perspective, and that current laboratory tasks are designed in a highly controlled and consequently relatively simplistic way that does not lend itself to simultaneously test a variety of cognitive processes.

      The best way to deal with this situation, we think, is to recognize that in different contexts (e.g., different tasks, different computational models, different subject populations), the same parameters can capture different behaviors, and different parameters can capture the same behaviors, for the reasons the reviewer lays out. Recognizing this helps to avoid misinterpreting modeling results, for example by focusing our interpretation of model parameters to our specific task and model, rather than aiming to generalize across multiple tasks. We think that recognizing this fact also helps us understand the factors that determine whether parameters will capture the same or different processes across contexts and whether they will generalize. This is why we estimated here whether different parameters generalize to different degrees, which other factors affect generalizability, etc. Knowing the practical consequences of using the kinds of models we currently use will therefore hopefully provide a first step in resolving the issues the reviewer laid out.

      It is interesting that one of the parameters that generalizes least is LR-. The authors make a compelling case that this is related to a "lose-stay" behavior that benefits participants in Task B but not in Task C, which makes sense given the probabilistic vs deterministic reward function. I wondered if we can rule out the alternative explanation that in Task C, LR- could reflect a different interpretation of instructions vis. a vis. what rewards indicate - do authors have an instruction check measure in either task that can be correlated with this "lose-stay" behavior and with LR-? And what does the "lose-stay" distribution look like, for Task C at least? I basically wonder if some of these inconsistencies can be explained by participants having diverging interpretations of the deterministic nature of the reward feedback in Task C. The order of tasks might matter here as well -- was task order the same across participants? It could be that due to the within-subject design, some participants may have persisted in global strategies that are optimal in Task B, but sub-optimal in Task C.

      The PCA analysis adds an interesting angle and a novel, useful lens through which we can understand divergence in what parameters capture across different tasks. One observation is that loadings for PC2 and PC3 are strikingly consistent for Task C, so it looks more like these PCs encode a pairwise contrast (PC2 is C with B and PC2 is C with A), primarily reflecting variability in performance - e.g. participants who did poorly on Task C but well on Task B (PC2) or Task A (PC3). Is it possible to disentangle this interpretation from the one in the paper? It also is striking that in addition to performance, the PCs recover the difference in terms of LR- on Task B, which again supports the possibility that LR- divergence might be due to how participants handle probabilistic vs. deterministic feedback.

      We appreciate this positive evaluation of our PCA and are glad that it could provide a useful lens for understanding parameters. We also agree to the reviewer's observation that PC2 and PC3 reflect task contrasts (PC2: task B vs task C; PC3: task A vs task C), and phrase it in the following way in the paper:

      “PC2 contrasted task B to task C (loadings were positive / negative / near-zero for corresponding features of tasks B / C / A; Fig. 3B). PC3 contrasted task A to both B and C (loadings were positive / negative for corresponding features on task A / tasks B and C; Fig. 3C).”

      Hence, the only difference between our interpretation and the reviewer’s seems to be whether PC3 contrasts task C to task B as well as task A, or just to task A. Our interpretation is supported by the fact that loadings for tasks A and C are quite similar on PC3; however, both interpretations seem appropriate.

      We also appreciate the reviewer's positive evaluation of the fact that the PCA reproduces the differences in LR-, and its relationship to probabilistic/deterministic feedback. The following section reiterates this idea:

      “alpha- loaded positively in task C, but negatively in task B, suggesting that performance increased when participants integrated negative feedback faster in task C, but performance decreased when they did the same in task B. As mentioned before, contradictory patterns of alpha- were likely related to task demands: The fact that negative feedback was diagnostic in task C likely favored fast integration of negative feedback, while the fact that negative feedback was not diagnostic in task B likely favored slower integration (Fig. 1E). This interpretation is supported by behavioral findings: "Lose-stay" behavior (repeating choices that produce negative feedback) showed the same contrasting pattern as alpha- on PC1. It loaded positively in task B, showing Lose-stay behavior benefited performance, but it loaded negatively on task C, showing that it hurt performance (Fig. 3A). This supports the claim that lower alpha- was beneficial in task B, while higher alpha- was beneficial in task C, in accordance with participant behavior and developmental differences.“

    1. Author Response

      Reviewer #1 (Public Review):

      Bice et al. present new work using an optogenetics-based stimulation to test how this affects stroke recovery in mice. Namely, can they determine if contralateral stimulation of S1 would enhance or hinder recovery after a stroke? The study provides interesting evidence that this stimulation may be harmful, and not helpful. They found that contralesional optogenetic-based excitation suppressed perilesional S1FP remapping, and this caused abnormal patterns of evoked activity in the unaffected limb. They applied a network analysis framework and found that stimulation prevented the restoration of resting-state functional connectivity within the S1FP network, and resulted in limb-use asymmetry in the mice. I think it's an important finding. My suggestions for improvement revolve around quantitative analysis of the behavior, but the experiments are otherwise convincing and important.

      Thank you for the positive feedback regarding our work.

      Other comments - Data and paper presentation:

      1) Figure 1A is misleading; it appears as if optogenetic stimulation is constant (which indeed would be detrimental to the tissue). Also, the atlas map overlaps color-wise with conditions; at a glance it looks like the posterior cortex might be stimulated; consider making greyscale?

      We have updated Figure 1A to address these concerns.

      Reviewer #2 (Public Review):

      These studies test the effect of stimulation of the contralateral somatosensory cortex on recovery, evoked responses, functional interconnectivity and gene expression in a somatosensory cortex stroke. Using transgenic mice with ChR2 in excitatory neurons, these neurons are stimulated in somatosensory cortex from days 1 after stroke to 4 weeks. This stimulation is fairly brief: 3min/day. Mice then received behavioral analysis, electrical forepaw stimulation and optical intrinsic signal mapping, and resting state MRI. The core finding is that this ChR2 stimulation of excitatory neurons in contralateral somatosensory cortex impairs recovery, evoked activity and interconnectivity of contralateral (to the stimulation, ipsilateral to the stroke) cortex in this localized stroke model. This is a surprising result, and resonates with some clinical findings, and a robust clinical discussion, on the role of the contralateral cortex in recovery. This manuscript addresses several important topics. The issue of brain stimulation and alterations in brain activity that the studies explore are also part of human brain stimulation protocols, and pre-clinical studies. The finding that contralateral stimulation inhibits recovery and functional circuit remapping is an important one. The rsMRI analysis is sophisticated.

      Thank you for the supportive comments regarding our manuscript

      Concerns:

      1) The gene expression data is to be expected. Stimulation of the brain in almost any context alters the expression of genes.

      We agree with the reviewer that stimulation of the brain is expected to broadly alter gene expression. However, in this set of studies, we examined a subset of genes that are of particular interest in neuroplasticity, and compared expression in ipsi-lesional vs. contra-lesional cortex in the presence or absence of contralesional stimulation during the post stroke recovery period. Genes like Arc, for example, have been shown by our group to be necessary for perilesional plasticity and recovery (Kraft, et al., Science Translational Medicine, 2018). The finding that validated plasticity genes are suppressed by contralesional stimulation is consistent with the central finding that contralesional stimulation suppresses the recovery of normal patterns of brain organization and activity. Importantly, there were also genes associated with spontaneous recovery that were unaltered or increased by contra-lesional brain stimulation. While these data do not provide causal associations, they may prove to be useful for developing hypotheses regarding molecular mechanisms involved in spontaneous brain repair for future studies.

      In light of the reviewer’s comment, we have altered text throughout to not focus on specific directionality of transcripts. Instead, we indicate that relevant transcript changes are those that are altered in association with spontaneous recovery, and which are altered in the opposite direction with contralesional brain stimulation.

      Minor points.

      1) Was the behavior and the functional imaging done while the brain was being stimulated?

      We have updated the methods (page 17) to clarify that the only experiments during which the photostimulus occurred during neuroimaging are reported in new Figure 6, and to clarify that photostimulation did not occur during the behavioral tests of asymmetry.

      2) It would be useful to understand what is being stimulated. The stimulation method is not described. Is an entire cortical width of tissue stimulated, and this is what is feeding back onto the contralateral cortex? Or is this stimulation mostly affecting excitatory (CaMKII+) cells in upper or lower layers? This will be important to be able to compare to the Chen et al study that gave rise to the stimulation approach here. This gets to the issue of the circuitry that is important in recovery, or in inhibiting recovery. One might answer this question by doing the stimulation and staining tissue for immediate early gene activation, to see the circuits with evoked activity. Also, the techniques used in this study could be applied with OIS or rsMRI during stimulation, to determine the circuits that are activated.

      We have clarified the stimulation protocol in response to Essential point 2.2. Due to light scattering and appreciable attenuation of 473nm in brain tissue, only ~1% of photons penetrate to a depth of 600 microns. Experimentally, this provides superficial-layer specificity to Layer 2/3 Camk2a cells (https://doi.org/10.1016/j.neuron.2011.06.004)

      To answer the question of what circuits are affecting recovery, we performed 2 sets of additional experiments – Experiment 1: OISI during photostimulation before and after photothrombosis, and Experiment 2: tissue staining for IEG expression (cFOS). We describe each below:

      Experiment 1 New results are included from 16 Camk2a-ChR2 mice (Results, page 10-11; Methods, page 18) and reported as new Figure 6. Similar to the previously reported experiments, all mice were subject to photothrombosis of left S1FP, half of which received interventional optogenetic photostimulation beginning 1 day after photothrombosis (+Stim) while the other half recovered spontaneously (-Stim). To visualize in real time whether contralesional photostimulation differentially affected global cortical activity in these 2 groups, concurrent awake OISI during acute contralesional photostimulation was performed in +Stim and –Stim groups before, 1, and 4 weeks after photothrombosis. At baseline, all mice exhibited focal increases in right S1FP activity during photostimulation that spread to contralateral (left) S1FP and other motor regions approximately 8-10 seconds after stimulus onset. While activity increases within the targeted circuit, subtle inhibition of cortical activity can also be observed in surrounding non-targeted cortices. Thus, activity both increases and decreases in different cortical regions during and after optogenetic stimulation of the right S1FP circuit. Of note, regions that are inhibited by S1FP stimulation show more pronounced decreases in activity in +Stim mice at 1 and 4 weeks compared to baseline and were significantly larger in +Stim mice compared to –Stim mice. We conclude that focal stimulation of contralesional cortex results in significant, widespread inhibitory influences that extend well beyond the targeted circuit.

      Experiment 2 For experiment 2, we hypothesized that IEG expression would increase in photostimulated regions, cortical regions functionally connected to targeted areas, and potentially deeper brain regions. For the IEG experiments, healthy ChR2 naïve animals (C57 mice) or CamK2a-ChR2 mice were acclimated to the head-restraint apparatus described in the manuscript used for photostimulation treatment. Once trained, awake mice were subject to the same photostimulus protocol as described in the manuscript applied to forepaw somatosensory cortex in the right hemisphere. After stimulation, mice were sacrificed, perfused, and brains were harvested for tissue slicing and immunostaining for cFOS. Tissue slices containing right and left primary forepaw somatosensory cortex and primary and secondary motor cortices (+0.5mm A/P) or visual cortex (-2.8mm A/P) were examined for cFOS staining and compared across groups.

      Below is a summary table of our findings, and representative tissue slices. While c-FOS IHC was successful, results are not consistent with expectations from the mouse strains used. Only 1 ChR2+ mouse exhibited staining patterns consistent with local S1FP photostimulation, while expression in ChR2- mice was more variable, and in some instances exhibits higher expression in targeted circuits compared to ChR2+ mice. It is possible that awake behaving mice already exhibit high activity in sensorimotor cortex at rest, which might obscure changes specific to optogenetic photostimulation. Regardless, because the tissue staining experiments were inconclusive in healthy animals, we did not proceed with further experiments in the stroke groups, and do not report these findings in the manuscript.

      3) Also, it is possible that contralateral stimulation is impairing recovery, not through an effect on the contralateral cortex (the site of the stroke), but on descending projections, or theoretically even through evoking activity or subclinical movement of the contralateral limb (ipsilateral to the stroke). By more carefully mapping the distribution of the activity of the stimulated brain region, and what exactly is being stimulated, these issues can be explored.

      The reviewer raises an excellent point. We have added to the “Limitations and Future work” section of the Discussion on pages 15-16

    1. Author Response

      Reviewer #1 (Public Review):

      It is now widely accepted that the age of the brain can differ from the person's chronological age and neuroimaging methods are ideally suited to analyze the brain age and associated biomarkers. Preclinical studies of rodent models with appropriate neuroimaging do attest that lifestyle-related prevention approaches may help to slow down brain aging and the potential of BrainAGE as a predictor of age-related health outcomes. However, there is a paucity of data on this in humans. It is in this context the present manuscript receives its due attention.

      Comments:

      1) Lifestyle intervention benefits need to be analyzed using robust biomarkers which should be profiled non-invasively in a clinical setting. There is increasing evidence of the role of telomere length in brain aging. Gampawar et al (2020) have proposed a hypothesis on the effect of telomeres on brain structure and function over the life span and named it as the "Telomere Brain Axis". In this context, if the authors could measure telomere length before and after lifestyle intervention, this will give a strong biomarker utility and value addition for the lifestyle modification benefits. 2) Authors should also consider measuring BDNF levels before and after lifestyle intervention.

      Response to comments 1+2: we agree that associating both telomere length and BDNF level with brain age would be interesting and relevant. However, we did not measure these two variables. We would certainly consider adding these in future work. Regarding telomere length, we now include a short discussion of brain age in relation to other bodily ages, such as telomere length (Discussion section):

      “Studying changes in functional brain aging is part of a broader field that examines changes in various biological ages, such as telomere length1, DNA methylation2, and arterial stiffness3. Evaluating changes in these bodily systems over time allows us to capture health and lifestyle-related factors that affect overall aging and may guide the development of targeted interventions to reduce age-related decline. For example, in the CENTRAL cohort, we recently reported that reducing body weight and intrahepatic fat following a lifestyle intervention was related to methylation age attenuation4. In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among dataset6 and larger training sample size5 may improve the accuracy of such models in the future. We also suggest that examining the dynamics of multiple bodily ages and their interactions would enhance our understanding of the complex aging process8,9. “

      And

      “These findings complement the growing interest in bodily aging indicated, for example, by DNA methylation4 as health biomarkers and interventions that may affect them.”

      Reviewer #2 (Public Review):

      In this study, Levakov et al. investigated brain age based on resting-state functional connectivity (RSFC) in a group of obese participants following an 18-month lifestyle intervention. The study benefits from various sophisticated measurements of overall health, including body MRI and blood biomarkers. Although the data is leveraged from a solid randomized control set-up, the lack of control groups in the current study means that the results cannot be attributed to the lifestyle intervention with certainty. However, the study does show a relationship between general weight loss and RSFC-based brain age estimations over the course of the intervention. While this may represent an important contribution to the literature, the RSFC-based brain age prediction shows low model performance, making it difficult to interpret the validity of the derived estimates and the scale of change. The study would benefit from more rigorous analyses and a more critical discussion of findings. If incorporated, the study contributes to the growing field of literature indicating that weight-reduction in obese subjects may attenuate the detrimental effect of obesity on the brain.

      The following points may be addressed to improve the study:

      Brain age / model performance:

      1) Figure 2: In the test set, the correlation between true and predicted age is 0.244. The fitted slope looks like it would be approximately 0.11 (55-50)/(80-35); change in y divided by change in x. This means that for a chronological age change of 12 months, the brain age changes by 0.11*12 = 1.3 months. I.e., due to the relatively poor model performance, an 80-year-old participant in the plot (fig 2) has a predicted age of ~55. Hence, although the age prediction step can generate a summary score for all the RSFC data, it can be difficult to interpret the meaning of these brain age estimates and the 'expected change' since the scale is in years.

      2) In Figure 2 it could also help to add the x = y line to get a better overview of the prediction variance. The estimates are likely clustered around the mean/median age of the training dataset, and age is overestimated in younger subs and overestimated in older subs (usually referred to as "age bias"). It is important to inspect the data points here to understand what the estimates represent, i.e., is variation in RSFC potentially lost by wrapping the data in this summary measure, since the age prediction is not particularly accurate, and should age bias in the predictions be accounted for by adjusting the test data for the bias observed in the training data?

      Response to comment 1+2: we agree with the reviewer that due to the relatively moderate correlation between the predicted and observed age, a large change in the observed age corresponds to a small change in the predicted age. We now state this limitation in Results section 2.1:

      “Despite being significant and reproducible, we note that the correlations between the observed and predicted age were relatively moderate.”

      And discuss this point in the Discussion section:

      “In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among dataset6 and larger training sample size5 may improve the accuracy of such models in the future.”

      Moreover, , we now add the x=y line to Fig. 2, so the readers can better assess the prediction variance as suggested by the reviewer:

      We prefer to avoid using different scales (year/month) in the x and y axes to avoid misleading the readers, but the list of observed and predicted ages are available as SI files with a precision of 2 decimals point (~3 days).

      We note that despite the moderate precision accuracy, we replicated these results in three separate cohorts.

      Regarding the effect of “age bias” (also known as “regression attenuation” or “regression dilution” 10), we are aware of this phenomenon and agree that it must be accounted for. In fact, the “age bias” is one of the reasons we chose to use the difference between the expected and observed ages as the primary outcome of the study, as this measure already takes this bias into account. To demonstrate this effect we now compute brain age attenuation in two ways: 1. As described and used in the current study (Methods 4.9); and 2. By regressing out the effect of age on the predicted brain age at both times separately, then subtracting the adjusted predicted age at T18 from the adjusted predicted age at T0. The second method is the standard method to account for age bias as described in a previous work 11. Below is a scatter plot of both measures across all participants:

      The x-axis represents the first method, used in the current study, and the y-axis represents the second method, described in Smith et al., (2019). Across all subjects, we found a nearly perfect 1:1 correspondence between the two methods (r=.998, p<0.001; MAE=0.45), as the two are mathematically identical. The small gap between the two is because the brain age attenuation model also takes into account the difference in the exact time that passed between the two scans for each participant (mean=21.36m, std = 1.68m).

      We now note this in Methods section 4.9:

      “We note that the result of computing the difference between the bias-corrected brain age gap at both times was nearly identical to the brain age attenuation measure (r=.99, p<0.001; MAE=0.45). The difference between the two is because the brain age attenuation model takes into account the difference in the exact time that passed between the two scans for each participant (mean=21.36m, std = 1.68m).”

      3) In Figure 3, some of the changes observed between time points are very large. For example, one subject with a chronological age of 62 shows a ten-year increase in brain age over 18 months. This change is twice as large as the full range of age variation in the brain age estimates (average brain age increases from 50 to 55 across the full chronological age span). This makes it difficult to interpret RSFC change in units of brain age. E.g., is it reasonable that a person's brain ages by ten years, either up or down, in 18 months? The colour scale goes from -12 years to 14 years, so some of the observed changes are 14 / 1.5 = 9 times larger than the actual time from baseline to follow-up.

      We agree that our model precision was relatively low, especially compared to the period of the intervention, as also stated by reviewer #1. We now discuss this issue in light of the studies pointed out by the reviewer (Discussion section):

      “In the current work, we used RSFC for brain age estimation, which resulted in a MAE of ~8 years, which was larger than the intervention period. Nevertheless, we found that brain age attenuation was associated with changes in multiple health factors. The precision of an age prediction model based on RSFC is typically lower than a model based on structural brain imaging5. However, a higher model precision may result in a lower sensitivity to detect clinical effects6,7. Better tools for data harmonization among datasets6 and larger training sample size5 may improve the accuracy of such models in the future.”

      Again, we note that despite the moderate precision accuracy, we replicated these results in three separate cohorts and found that both the correlation and the MAE between the predicted and observed age were significant in all of them.

      RSFC for age prediction:

      1) Several studies show better age prediction accuracy with structural MRI features compared to RSFC. If the focus of the study is to use an accurate estimate of brain ageing rather than specifically looking at changes in RSFC, adding structural MRI data could be helpful.

      We focused on brain structural changes in a previous work, and the focus of the current work was assessing age-related functional connectivity alterations. We now added a few sentences in the Introduction section that would hopefully better motivate our choice:

      “We previously found that weight loss, glycemic control, lowering of blood pressure, and increment in polyphenols-rich food were associated with an attenuation in brain atrophy 12. Obesity is also manifested in age-related changes in the brain’s functional organization as assessed with resting-state functional connectivity (RSFC). These changes are dynamic13 and can be observed in short time scales14 and thus of relevance when studying lifestyle intervention.”

      2) If changes in RSFC are the main focus, using brain age adds a complicated layer that is not necessarily helpful. It could be easier to simply assess RSFC change from baseline to follow up, and correlate potential changes with changes in e.g., BMI.

      We are specifically interested in age-related changes as we described a-priori in the registration of the study: https://clinicaltrials.gov/ct2/show/NCT03020186

      Moreover, age-related changes in RSFC are complex, multivariate and dependent upon the choice of theoretical network measures. We think that a data-driven brain age prediction approach might better capture these multifaceted changes and their relation to aging. We now state this in the Introduction section:

      “Studies have linked obesity with decreased connectivity within the default mode network15,16 and increased connectivity with the lateral orbitofrontal cortex17, which are also seen in normal aging18,19. Longitudinal trials have reported changes in these connectivity patterns following weight reduction20,21, indicating that they can be altered. However, findings regarding functional changes are less consistent than those related to anatomical changes due to the multiple measures22 and scales23 used to quantify RSFC. Hence, focusing on a single measure, the functional brain age, may better capture these complex, multivariant changes and their relation to aging. “

      The lack of control groups

      1) If no control group data is available, it is important to clarify this in the manuscript, and evaluate which conclusions can and cannot be drawn based on the data and study design.

      We agree that this point should be made more clear, and we now state this in the limitation section of the Discussion:

      “We also note that the lack of a no-intervention control group limits our ability to directly relate our findings to the intervention. Hence, we can only relate brain age attenuation to the observed changes in health biomarkers.”

      Also, following reviewers’ #2 and #3 comments, we refer to the weight loss following 18 months of lifestyle intervention instead of to the intervention itself. This is now made clear in the title, abstract, and the main text.

      Reviewer #3 (Public Review):

      The authors report on an interesting study that addresses the effects of a physical and dietary intervention on accelerated/decelerated brain ageing in obese individuals. More specifically, the authors examined potential associations between reductions in Body-Mass-Index (BMI) and a decrease in relative brain-predicted age after an 18-months period in N = 102 individuals. Brain age models were based on resting-state functional connectivity data. In addition to change in BMI, the authors also tested for associations between change in relative brain age and change in waist circumference, six liver markers, three glycemic markers, four lipid markers, and four MRI fat deposition measures. Moreover, change in self-reported consumption of food, stratified by categories such as 'processed food' and 'sweets and beverages', was tested for an association with change in relative brain age. Their analysis revealed no evidence for a general reduction in relative brain age in the tested sample. However, changes in BMI, as well as changes in several liver, glycemic, lipid, and fat-deposition markers showed significant covariation with changes in relative brain age. Three markers remained significant after additionally controlling for BMI, indicating an incremental contribution of these markers to change in relative brain age. Further associations were found for variables of subjective food consumption. The authors conclude that lifestyle interventions may have beneficial effects on brain aging.

      Overall, the writing is concise and straightforward, and the langue and style are appropriate. A strength of the study is the longitudinal design that allows for addressing individual accelerations or decelerations in brain aging. Research on biological aging parameters has often been limited to cross-sectional analyses so inferences about intra-individual variation have frequently been drawn from inter-individual variation. The presented study allows, in fact, investigating within-person differences. Moreover, I very much appreciate that the authors seek to publish their code and materials online, although the respective GitHub project page did not appear to be set to 'public' at the time (error 404). Another strength of the study is that brain age models have been trained and validated in external samples. One further strength of this study is that it is based on a registered trial, which allows for the evaluation of the aims and motivation of the investigators and provides further insights into the primary and secondary outcomes measures (see the clinical trial identification code).

      One weakness of the study is that no comparison between the active control group and the two experimental groups has been carried out, which would have enabled causal inferences on the potential effects of different types of interventions on changes in relative brain age. In this regard, it should also be noted that all groups underwent a lifestyle intervention. Hence, from an experimenter's perspective, it is problematic to conclude that lifestyle interventions may modulate brain age, given the lack of a control group without lifestyle intervention. This issue is fueled by the study title, which suggests a strong focus on the effects of lifestyle intervention. Technically, however, this study rather constitutes an investigation of the effects of successful weight loss/body fat reduction on brain age among participants who have taken part in a lifestyle intervention. In keeping with this, the provided information on the main effect of time on brain age is scarce, essentially limited to a sign test comparing the proportions of participants with an increase vs. decrease in relative brain age. Interestingly, this analysis did not suggest that the proportion of participants who benefit from the intervention (regarding brain age) significantly exceeds the number of participants who do not benefit. So strictly speaking, the data rather indicates that it's not the lifestyle intervention per sé that contributes to changes in brain age, but successful weight loss/body fat reduction. In sum, I feel that the authors' claims on the effects of the intervention cannot be underscored very well given the lack of a control group without lifestyle intervention.

      We agree that this point, also raised by reviewer #2, should be made clear, and we now state this in the limitation section of the Discussion:

      “We also note that the lack of a no-intervention control group limits our ability to directly relate our findings to the intervention. Hence, we can only relate brain age attenuation to the observed changes in health biomarkers.”

      Also, following reviewers #2 and #3, we refer to the weight loss following 18 months of lifestyle intervention instead of to the intervention itself. This is now explicitly mentioned in the title, abstract, and within the text:

      Title: “The effect of weight loss following 18 months of lifestyle intervention on brain age assessed with resting-state functional connectivity”

      Abstract: “…, we tested the effect of weight loss following 18 months of lifestyle intervention on predicted brain age, based on MRI-assessed resting-state functional connectivity (RSFC).”

      Another major weakness is that no rationale is provided for why the authors use functional connectivity data instead of structural scans for their age estimation models. This gets even more evident in view of the relatively low prediction accuracies achieved in both the validation and test sets. My notion of the literature is that the vast majority of studies in this field implicate brain age models that were trained on structural MRI data, and these models have achieved way higher prediction accuracies. Along with the missing rationale, I feel that the low model performances require some more elaboration in the discussion section. To be clear, low prediction accuracies may be seen as a study result and, as such, they should not be considered as a quality criterion of the study. Nevertheless, the choice of functional MRI data and the relevance of the achieved model performances for subsequent association analysis needs to be addressed more thoroughly.

      We agree that age estimation from structural compared to functional imaging yields a higher prediction accuracy. In a previous publication using the same dataset12, we demonstrated that weight loss was associated with an attenuation in brain atrophy, as we describe in the introduction:

      “We previously found that weight loss, glycemic control and lowering of blood pressure, as well as increment in polyphenols rich food, were associated with an attenuation in brain atrophy 12.”

      Here we were specifically interested in age-related functional alterations that are associated with successful weight reduction. Compared to structural brain changes aging effect on functional connectivity is more complex and multifaced. Hence, we decided to utilize a data-driven or prediction-driven approach for assessing age-related changes in functional connectivity by predicting participants’ functional brain age. We now describe this rationale in the introduction section:

      “Studies have linked obesity with decreased connectivity within the default mode network15,16 and increased connectivity with the lateral orbitofrontal cortex17, which are also seen in normal aging18,19. Longitudinal trials have reported changes in these connectivity patterns following weight reduction20,21, indicating that they can be altered. However, findings regarding functional changes are less consistent than those related to anatomical changes due to the multiple measures22 and scales23 used to quantify RSFC. Hence, focusing on a single measure, the functional brain age, may better capture these complex changes and their relation to aging.”

      We address the point regarding the low model performance in response to reviewer #2, comment #2.

    1. Author Response:

      Evaluation Summary:

      The authors studied the neural correlates of planning and execution of single finger presses in a 7T fMRI study focusing on primary somatosensory (S1) and motor (M1) cortices. BOLD patterns of activation/deactivation and finger-specific pattern discriminability indicate that M1 and S1 are involved not only during execution, but also during planning of single finger presses. These results contribute to a developing story that the role of primary somatosensory cortex goes beyond pure processing of tactile information and will be of interest for researchers in the field of motor control and of systems neuroscience.

      We thank all reviewers and the editor for their assessment of our paper. We acknowledge that our description of the methods and some interpretation of the results can be clarified and expanded. We address every comment and proposed suggestion in the following below.

      Reviewer #1 (Public Review):

      This is a very important study for the field, as the involvement of S1 in motor planning has never been described. The paradigm is very elegant, the methods are rigorous and the manuscript is clearly written. However, there are some concerns about the interpretation of the data that could be addressed.

      We thank Reviewer #1 for the positive evaluation of our study. We clarify our methodological choices and interpretation of the data in the following response.

      • The authors claim that planning and execution patterns are scaled version of each other, and that overt movement during planning is prevented by global deactivation. This is an interesting perspective, however the presented data are not fully convincing to support this claim:

      (1) the PCM analysis shows that correlation models ranging from 0.4 to 1 perform similarly to the best correlation model. This correlation range is wide and suggests that the correspondence between execution/planning patterns is only partial.

      The reviewer is correct that the current data leaves us with a specific amount of uncertainty. However, it should be noted that the maximum-likelihood estimates of correlations between noisy patterns are biased, as they are constrained to be smaller or equal to 1. Thus, we cannot test the hypothesis that the correlation is 1 by just comparing correlation estimates to 1 (for details on this, see our recent blog on this topic: http://www.diedrichsenlab.org/BrainDataScience/noisy_correlation/). To test this idea, we therefore use a generative approach (the PCM analysis). We find that no correlation model has a higher log-likelihood than the 1-correlation model, therefore we cannot rule out that the underlying true correlation is actually 1. In other words, we have as much evidence that the correspondence is only partial as we do that the correspondence is perfect. The ambiguity given by the wide correlation range is due to the role of measurement noise in the data and should not be interpreted as if the true correlation was lower than 1. What we can confidently conclude is that activity patterns have a substantial positive correlation between planning and execution. We take this opportunity to clarify this point in the results section.

      (2) in Fig.4 A-B, the distance between execution/planning patterns is much larger than the distance between fingers. How can such a big difference be explained if planning/execution correspond to scaled versions of the same finger-specific patterns? If the scaling is causing this difference, then different normalization steps of the patterns should have very specific effects on the observed results: 1) removing the mean value for each voxel (separately for execution and planning conditions) should nullify the scaling and the planning/execution patterns should perfectly align in a finger-specific way; 2) removing the mean pattern (separately for each finger conditions) should effectively disturb the finger-specific alignment shown in Fig.4C. These analyses would corroborate the authors' conclusion.

      The large distance between planning and execution patterns (compared to the distance between fingers) is caused by the fact that the average activity pattern associated with planning differs substantially from the average activity pattern during execution. Such a large difference is of course expected, given the substantially higher activity during execution. However, here we are testing the hypothesis that the pattern vectors that are related to a specific finger within either planning or execution are scaled version of each other. Visually, this can be seen in Figure 4B (bottom), where the MDS plot is rotated, such the line of sight is in the direction of the mean pattern difference between planning and execution—such that it disappears in the projection. Relative to the baseline mean of the data (cross), you can see that arrangement of the fingers in planning (orange) is a scaled version of the arrangement during execution (blue). The PCM model provides a likelihood-based test for this idea. The model accounts for the overall difference between planning and execution by including (and estimating) model terms related to the mean pattern of planning and execution, respectively, therefore effectively removing the mean activation of planning and execution. We have now explained this better in the results and methods sections, also referring to a Jupyter notebook example of the correlation model used (https://pcm-toolbox-python.readthedocs.io/en/latest/demos/demo_correlation.html).

      Regarding your analysis suggestions, removing the mean pattern for planning and execution across fingers as a fixed effect (suggestion 1) leads to the distance structure shown in Fig 4B (bottom)—showing that the finger-specific patterns during planning are scaled versions of those during execution (also see Fig. R1 below). On the other hand, subtracting the mean finger pattern across planning and execution (suggestion 2) will not fully remove the finger specific activation as the finger-specific patterns are differently scaled in planning and execution. Furthermore, neither of these subtraction analyses allows for a formal test of the hypotheses that the data can be explained by a pure scaling of the finger-specific patterns.

      Figure R1. RDM of left S1 activity patterns evoked by the three fingers (1, 3, 5) during no-go planning (orange) and execution (blue) after removing the mean pattern across fingers (separately for planning and execution). The bottom shows the corresponding multidimensional scaling (MDS) projection of the first two principal components. Black cross denotes mean pattern across conditions.

      • A conceptual concern is related to the task used by the authors. During the planning phase, as a baseline task, participants are asked to maintain a low and constant force for all the fingers. This condition is not trivial and can even be considered a motor task itself. Therefore, the planning/execution of the baseline task might interfere with the planning/execution of the finger press task. Even more controversial, the design of the motor task might be capturing transitions between different motor tasks (force on all finger towards single-finger press) rather than pure planning/execution of a single task. The authors claim that the baseline task was used to control for involuntary movements, however, EMG recordings could have similarly controlled for this aspect, without any confounds.

      Participants received training the day before scanning, which made the “additional” motor task very easy, almost trivial. In fact, the system was calibrated so that the natural weight of the hand on the keys was enough to bring the finger forces within the correct range to be maintained. Thus, very little planning/online control was required by the participants before pressing the keys. As for the concern of capturing transitions between different motor tasks, that it is indeed always the case in natural behavior. Arguably there is no such thing as “pure rest” in the motor system, active effort has to be made even to maintain posture. Furthermore, if the motor system considers the hold phase as a simultaneous movement phase, it should have prevented M1 and S1 to participate in the planning of upcoming movements, as it would be busy with maintaining and controlling the pre-activation. Having found clear planning related signals in M1 and S1 in this situation makes our argument, if anything, stronger.

      Finally, we specifically chose not to do EMG recordings because finger forces are a more sensitive measure of micro movements than EMG. Extensive pilot experiments for our papers studying ipsilateral representations and mirroring (e.g., Diedrichsen et al., 2012; Ejaz et al., 2018) have shown that we can pick up very subtle activations of hand muscles by measuring forces of a pre-activated hand, signals that clearly escape detection when recording EMG in the relaxed state. Based on these results, we actually consider the recording of EMG during the relaxed state as an insufficient control for the absence of cortical-spinal drive onto hand muscles. This is especially a concern when recording EMG during scanning, due to the decreased signal-to-noise ratio.

      • In Fig.2F, the authors show no-planning related information in high-order areas (PMd, aSPL), while such information is found in M1 and S1. This null result from premotor and parietal areas is rather surprising, considering current literature, largely cited by the authors, pointing to high-order motor or parietal areas involved in action planning.

      We agree with the reviewer that, to some extent, the lack of involvement of high-order areas in planning is surprising. However, we believe that task difficulty (i.e., planning demands) plays a role in the amount of observed planning activation. In other words, because participants were only asked to plan repeated movements of one finger, there was little to plan. The fact that this may have contributed to the null result in premotor and parietal areas was further confirmed by the second half of the dataset, which is not reported in the current paper. Here, we investigated the planning of multi-finger sequences, where planning demands are certainly higher. We found that high-order areas such as PMd and SPL were indeed active and involved in the planning of those, as expected. We decided to split the dataset across two publications as the multi-finger sequences have their own complexities, which would have distracted from the main finding of planning related activity in M1 and S1.

      Reviewer #3 (Public Review):

      I found the manuscript to be well written and the study very interesting. There are, however, some analytical concerns that in part arise because of a lack of clarity in describing the analyses.

      1) Some details regarding the methods used and results in the figures were missing or difficult to understand based on the brief description in the Methods section or figure legend.

      We thank Reviewer #3 for pointing out some lack of clarity in our description of the methods. We now expanded both the methods section and the figure captions (Fig. 2-3-4).

      2) I think the manuscript would benefit from a more balanced description on the role of S1. As the authors state, S1 is traditionally thought to process afferent tactile and proprioceptive input. However, in the past years, S1 has been shown to be somatopically activated during touch observation, attempted movements in the absence of afferent tactile inputs, and through attentional shifts (Kikkert et al., 2021; Kuehn et al., 2014; Puckett et al., 2017; Wesselink et al., 2019). Furthermore, S1 is heavily interconnected with M1, so perhaps if such activity patterns are present in M1, they could also be expected in S1?

      To better characterize the role of S1 during movement planning, we now include recent research showing that S1 can be somatotopically recruited even in the absence of tactile inputs.

      3) Related to the previous comment: If attentional shifts on fingers can activate S1 somatotopically, could this potentially explain the results? Perhaps the participants were attending to the fingers that were cued to be moved and this would have led to the observed activity patterns. I don't think the data of the current study allows the authors to tease apart these potential contributions. It is likely that both processes contribute simultaneously.

      We agree that our results could also be explained by attentional shifts on the fingers. It is very likely that, during planning, participants were specifically focusing on the cued finger. However, as the reviewer points out, our current dataset cannot distinguish between planning and attention as voluntary planning requires attention. We expanded the discussion section to include this possibility.

      4) The authors repeatedly interpret the absences of significant differences as indicating that the tested entities are the same. This cannot be concluded based on results of frequentist statistical testing. If the authors would like to make such claims, then they I think they should include Bayesian analysis to investigate the level of support for the null hypothesis.

      We have now clarified the parts in the manuscript that sounded as if we were interpreting the absence of significant difference (null results) as significant absence of differences (equivalence).

    1. Author Response

      Reviewer #1 (Public Review):

      This study investigates how pathogens might shape animal societies by driving the evolution of different social movement rules. The authors find that higher disease costs induce shifts away from positive social movement (preference to move towards others) to negative social movement (avoidance from others). This then has repercussions on social structure and pathogen spread.

      Overall, the study comprises a good mixture of intuitive and less intuitive results. One major weakness of the work, however, is that the model is constructed around one pathogen that repeatedly enters a population across hundreds of generations. While the authors provide some justification for this, it does not capture any biological realism in terms of the evolution of the pathogen itself, which would be expected. The lack of co-evolution in the model substantially limits the generality of the results. For example, a number of recent studies have reported that animals might be expected to become very social when pathogens are very infectious, because if the pathogen is unavoidable they may as well gain the benefits of being social. The authors make some arguments about being focused on introduction events, but this does not really align well with their study design that carries through many generations after the introduction. Given the rapid evolutionary dynamics, perhaps the study could have a more focused period immediately after the initial introduction of the pathogen to look at rapid evolutionary responses (albeit this may need some sensitivity analyses around the parameters such as the mutation rates).

      We appreciate the reviewer’s evaluation of our work, and acknowledge that we have not currently included evolutionary dynamics for the pathogen.

      One conceptual impediment to such inclusion is knowing how pathogen traits could be modelled in a mechanistic way. For example, it is widely held that there is a trade-off between infection cost and transmissibility, with a quadratic relationship between them, but this is a pattern and not a process per se. We are unsure which mechanisms could be modelled that impinge upon both infection cost and transmissibility.

      On the practical side, we feel that a mechanistic, individual-based model that includes both pathogen and host evolution would become very challenging to interpret. It might be more tractable to begin with a mechanistic, spatial model that examines pathogen trait evolution with an unchanging host (such as an adaptation of Lion and Boots, 2010). We would be happy to take this on in future work, with a view to combining models thereafter.

      We have taken the suggestion to focus on the period immediately after the introduction, and we now focus on the following 500 generations. While 500 generations is still a long time, we would note that our model dynamics typically stabilise within 200 generations. We show the following generations primarily to check that some stability in the dynamics has indeed been reached (but see our new scenario 2).

      We also appreciate the point regarding mutation rates. Our mutation rates are relatively high to account for the small size of our population. We have found that with smaller mutation rates (0.001 rather than 0.01), evolutionary shifts in our population do not occur within the first 500 generations. This is primarily because prior to pathogen introduction, the ‘agent avoiding’ strategy that becomes common later is actually quite rare. Whether a rapid transition takes place thus depends on whether there are any agent avoiding individuals in the population at the moment of pathogen introduction, or on whether such individuals emerge rapidly thereafter through mutations on the social weights. We expect that with larger population sizes, we would be able to recover our results with smaller mutation rates as well.

      A final, and much more minor comment is whether this is really a paper about movement. The model does not really look at evolutionary changes in how animals move, but rather at where they move. How important is the actual movement process under this model? For example, would the results change if the model was constructed without explicit consideration of space and resources, but instead simply modelled individuals' decisions to form and break ties? (Similar to the recent paper by Ashby & Farine https://onlinelibrary.wiley.com/doi/full/10.1111/evo.14491 ). It might help to provide more information about how putting social decisions into a spatially explicit framework is expected to extend studies that have not done so (e.g.., because they are analytical).

      This paper is indeed about movement, as where to move is a key part of the movement ecology paradigm (Nathan et al. 2008). That said, we appreciate the advice to emphasise the importance of social decisions in a spatial context, we have added these to the Introduction (L. 79 – 81) and Discussion (L. 559 – 562). In brief, we do expect different dynamics that result from the explicit spatial context, as compared to a model in which social associations are probabilistic and could occur with any individual in the population.

      In our models, individual social tendency (whether they are prefer moving towards others) is separated from individual sociality (whether they actually associate with other individuals). This can be seen from our (new) Fig. 3D, in which individuals of each of the social strategies can sometimes have similar numbers of associations (although modulated by movement). This separation of the pattern from the underlying process is possible, we believe, due to the heterogeneity in the social landscape created by the explicit spatial context.

      Reviewer #2 (Public Review):

      This theoretical study looks at individuals' strategies to acquire information before and after the introduction of pathogens into the system. The manuscript is well-written and gives a good summary of the previous literature. I enjoyed reading it and the authors present several interesting findings about the development of social movement strategies. The authors successfully present a model to look at the costs and benefits of sociality.

      I have a couple of major comments about the work in its current form that I think are very important for the authors to address. That said, I think this is a promising start and that with some revisions, this could be a valuable contribution to the literature on behavioral ecology.

      We appreciate the reviewer’s kind words.

      Before starting, I would like to be precise that, given the scope of the models and the number of parameter choices that were necessary, I am going to avoid criticisms of the decisions made when designing the models. However, there are a few assumptions I rather find problematic and would like to give proper attention to.

      The first regards social vs. personal information. Most of the model argumentation is based on the reliance on social information (considering four, but to me overlapping, social strategies that are somehow static and heritable) but in fact, individuals may oscillate between relying on their personal information and/or on social information -- which may depend on the availability of resources, population density, stochastic factors, among others (Dall et al. 2005 Trends Ecol. Evol., Duboscq et al. 2016 Frontiers in Psychology). In my opinion, ignoring the influence of personal and social information decreases the significance of this work. I am aware that the authors consider the detection of food present in the model, but this is considered to a much smaller extent (as seen in their weight on individual decisions) than the social information cues.

      We appreciate the point that individuals can switch between relying on social and personal information. However, we would point out that in our model, the social strategies are not static. The social strategy is a convenient way of representing individuals’ position in behavioural trait-space (the ‘behavioural hypervolume’ of Bastille-Rousseau and Wittemeyer 2019). This essentially means that the importance assigned to each of the three cues available in our model varies among individuals. There are indeed individuals that are primarily guided by the density of food items, and this is the commonest ‘overall’ movement strategy before the pathogen is introduced. We represent this by showing how the importance of social information is low before pathogen introduction (Fig. 2B).

      While we primarily focus on the importance of social information, this is because the population quite understandably evolves a persistent preference for moving towards food items (i.e., using personal information if available). We have made this clearer in the text on lines 367 – 371.

      Critically, it is also unclear how, if at all, the information and pathogen traits are related to each other. If a handler gets sick, how does this affect its foraging activity (does it stop foraging, slow its activities, or does it show signs of sickness)? Perhaps this model is attempting to explore the emergence of social movement strategies only, but how they disentangle an individual's sickness status and behavioral response is unclear.

      We appreciate that infection may lead to physiological effects (e.g. altered metabolic rates, reduction in cognitive capacity) that may then influence behaviour. Our model aims to be relatively simple and general one, and does not consider the explicit mechanisms by which infection imposes a cost on fitness. Thus we do not include any behavioural modifications due to infection, as we feel that these would be much too complex to include in such a model. We would be happy to explore, in future work, phenomena such as the evolution of self-isolation and infection detection which is common among animals such as social insects (Stroeymeyt et al. 2018, Pusceddu et al. 2021).

      However, we have considered an alternative implementation of our model’s scenario 1 which could be interpreted as the infection reducing foraging efficiency by a certain percentage (other interpretations of the redirection of energy away from reproduction are also possible). We show how this implementation leads to very similar outcomes as those seen in our

      Very little is presented about the virulence of the pathogens and how they could affect the emergence of social strategies. The authors keep their main argumentation based on the introduction of novel pathogens (without distinctions on their pathogenicity), but a behavioral response is rather influenced by how fast individuals are infected and which are their chances of recovering. Besides, they consider that only one or two social interactions would be enough for pathogen transmission to occur.

      We have indeed considered a fixed transmission probability of 0.05, a relatively modest attack rate. Setting transmission probability to two other values (0.025, 0.1), we find that our general results are recovered - there is an evolutionary transition away from sociality, with the proportion of agent avoidance evolved increasing with the transmission probability. While we do not show these results in the main text, we have included figures showing the proportions of each social movement strategy here for the reviewers’ reference.

      Figures showing the proportion of social movement strategies in two simulation runs of our default implementation of scenario 1 (dE = 0.25, R = 2, pathogen introduction begins from G = 500). Top: Probability of transmission = 0.025 (half of the default). Bottom: Probability of transmission = 0.10 (double the default). Overall, the proportion of agent avoidance evolved (purple) increases with the probability of transmission. Each figure shows a single replicate of each parameter combination, for only 1,000 generations.

      Another important component is that individuals do not die, and it seems that they always have a chance (even if it is small) to reproduce. So, how the authors consider unsuccessful strategies in the model outputs or how these social strategies would be potentially "dismissed" by natural selection are not considered.

      We appreciate the point that our simulation does not include mortality effects, and that all individuals have some small chance of reproducing. There are a few practical and conceptual challenges when incorporating this level of realism in a general model. Including mortality effects could allow for the emergence of more complex density-dependent dynamics, as dead individuals would not be able to transmit the pathogen to other foragers (although for some pathogens, this could be a valid choice), nor would they be sources of social information. This would make the model much more challenging to interpret, and we have tried to keep this model as simple as possible.

      We have also sought to keep the model’s focus on the evolutionary dynamics, and to not focus on mortality. In order to balance this aim with the reviewer's suggestion, we have included a new implementation of the model’s scenario 1 which has a threshold on reproduction. That means that only individuals with a positive energy balance (intake > infection costs) are allowed to reproduce. We show a potentially counter-intuitive result, that the more social ‘handler tracking’ strategy persists at a higher frequency than in our default implementation, despite having a higher infection rate than the ‘agent avoiding’ strategy. We suggest that this is because the ‘agent avoiding’ individuals have very low or no intake. This is sufficient in our default implementation to have relatively higher fitness than the more frequently infected handler tracking individuals.

      Reviewer #3 (Public Review):

      Gupte and colleagues develop an individual-based model to examine how the introduction of a novel pathogen influences the evolution of social cue use in a population of agents for which social cues can both facilitate more efficient foraging, but also expose individuals to infection. In their simulations, individuals move across a landscape in search of food, and their movements are guided by a combination of cues related to food patches, individuals that are currently handling food items, and individuals that are not actively handling food. The latter two cues can provide indirect information about the likely presence of food due to the patchiness of food across the landscape.

      The authors find that prior to introducing the novel pathogen, selection favors strategies that home in on agents, regardless of whether those agents are currently handling food items. The overall contribution of these social cues to movement decisions, however, tends to be relatively small. After pathogen introduction, agents evolve to rely more heavily on social information and to either be more selective in their use of it (attending to other agents that are currently handling food and avoiding non-handlers) or avoiding other agents altogether. Gupte and colleagues further examine the ecological consequences of these shifts in social decision-making in terms of individuals' overall movement, food consumption, and infection risk. Relative to pre-introduction conditions, individuals move more, consume less food, and are less likely to be infected due to reduced contact with others. Epidemiological models on emergent social networks confirm that evolved behavioral changes generate networks that impede the spread of disease.

      The introduction of novel pathogens into wild populations is expected to be increasingly common due to climate change and increasing global connectedness. The approach taken here by the authors is a potentially worthwhile avenue to explore the potential eco-evolutionary consequences of such introductions. A major strength of this study is how it couples ecological and evolutionary timescales. Dominant behavioral strategies evolve over time in response to changing environmental conditions and impact social, foraging, and epidemiological dynamics within generations. I imagine there are many further questions that could be fruitfully explored using the authors' framework. There are, however, important caveats that impact the interpretation of the authors' findings.

      First, reproduction bears no cost in this model. Individuals produce offspring in proportion to their lifetime net energy intake, which is increased by consuming food and decreased by a set amount per turn once infected. However, prior to reproduction, net energy intake is normalized (0-1) according to the lowest individual value within the generation. This means that individuals need not maintain a positive energy balance nor even consume food at all to successfully reproduce, so long as they perform reasonably well relative to other members of the population. Since consuming food is not necessary to reproduce, declining per capita intake due to evolved social avoidance (Fig. 1d) likely decreases the importance of food to an individual's reproductive success relative to simply avoiding infection. This dynamic could explain the delayed emergence of the 'agent avoiding' strategy (Fig. 1a), as this strategy potentially is only viable once per capita intake reaches a sufficiently low level across the population (Fig. 1d). I am curious to know what the results would be if reproduction required some minimal positive net energy, such that individuals must risk food patches in order to reproduce. It would also be useful for the authors to provide information on how net energy intake changes across generations, as well as whether (and if so, how) attraction to the food itself may change over time.

      We thank the reviewer for their assessment of our work, and appreciate the point raised here (and in an earlier review) about individuals potentially reproducing without any intake. We have addressed this by running our default model [repeated introductions, R = 2, dE = 0.25], with a threshold on reproduction such that only individuals with a positive energy balance can reproduce. We mention these results in the text (L. 495 – 500), and show related figures in the SI Appendix. In brief, as the reviewer suggests, agent avoiding is less common for our default parameter combination, but becomes as common as the default combination when the infection cost is doubled (to dE = 0.5).

      We appreciate the reviewer’s suggestion about decreasing per-capita intake being a precondition for the proliferation of the agent avoiding strategy. With our new results, we now show that there is no overall decrease in intake, but the agent avoiding strategy still becomes a common strategy after pathogen introduction. As the reviewer suggests, this is because these individuals have an equivalent net energy as handler tracking individuals, as they are less frequently infected.

      We suggest that the delayed emergence of the agent avoiding strategy is primarily due to mutation limitations – such individuals are uncommon or non-existent in the simulation before pathogen introduction, and random mutations are required for them to emerge. As we have noted in response to an earlier comment, this becomes clear when the mutation rate is reduced from 0.01 to 0.001 – agent avoidance usually does not evolve at all.

      A second important caveat is that the evolutionary responses observed in the model only appear when novel pathogen introductions are extremely frequent. The model assumes no pathogen co-evolution, but rather that the same (or a functionally identical) pathogen is re-introduced every generation (spillover rate = 1.0). When the authors considered whether evolutionary responses were robust to less frequent introductions, however, they found that even with a per-generation spillover rate of 0.5, there was no impact on social movement strategies. The authors do discuss this caveat, but it is worth highlighting here as it bears on how general the study's conclusions may be.

      We appreciate the reviewer’s point entirely. We would point out that current knowledge about pathogen introductions across species and populations in the wild is very poor. However, the ongoing highly pathogenic avian influenza outbreak (Wille and Barr 2022), the spread of multiple strains of SARS-CoV-2 to wild deer in several different human-to-wildlife transmission events, and recent work on the potential for coronavirus spillovers from bats to humans, all suggest that at least some generalist pathogens must circulate quite widely among wildlife, often crossing into novel host species or populations. We have added these considerations to the text on lines 218 – 231.

      We have also added, in order to confront this point more squarely, a new scenario of our model in which the pathogen is introduced just once, and then transmits vertically and horizontally among individuals (lines 519 – 557). This scenario more clearly suggests when evolutionary responses to pathogen introductions are likely to occur, and what their consequences might be for a pathogen becoming endemic in a population. This scenario also serves as a potential starting point for models of host-pathogen trait co-evolution, and we have added this consideration to the text on lines 613 – 623.

      References

      ● Albery, G. F. et al. 2021. Multiple spatial behaviours govern social network positions in a wild ungulate. - Ecology Letters 24: 676–686.

      ● Bastille-Rousseau, G. and Wittemyer, G. 2019. Leveraging multidimensional heterogeneity in resource selection to define movement tactics of animals. - Ecology Letters 22: 1417–1427.

      ● Gupte, P. R. et al. 2021. The joint evolution of animal movement and competition strategies. - bioRxiv in press.

      ● Lion, S. and Boots, M. 2010. Are parasites ‘“prudent”’ in space? - Ecology Letters 13: 1245–1255.

      ● Lloyd-Smith, J. O. et al. 2005. Superspreading and the effect of individual variation on disease emergence. - Nature 438: 355–359.

      ● Nathan, R. et al. 2008. A movement ecology paradigm for unifying organismal movement research. - PNAS 105: 19052–19059.

      ● Pusceddu, M. et al. 2021. Honey bees increase social distancing when facing the ectoparasite varroa destructor. - Science Advances 7: eabj1398.

      ● Sánchez, C. A. et al. 2022. A strategy to assess spillover risk of bat SARS-related coronaviruses in Southeast Asia. - Nat Commun 13: 4380.

      ● Stroeymeyt, N. et al. 2018. Social network plasticity decreases disease transmission in a eusocial insect. - Science 362: 941–945.

      ● Wilber, M. Q. et al. 2022. A model for leveraging animal movement to understand spatio-temporal disease dynamics. - Ecology Letters in press.

      ● Wille, M. and Barr, I. G. 2022. Resurgence of avian influenza virus. - Science 376: 459–460.

    1. Author Response:

      Reviewer #1:

      In this paper, Alhussein and Smith set out to determine whether motor planning under uncertainty (when the exact goal is unknown before the start of the movement) results in motor averaging (average between the two possible motor plans) or in performance optimization (one movement that maximizes the probability of successfully reaching to one of the two targets). Extending previous work by Haith et al. with two new, cleanly designed experiments, they show that performance optimization provides a better explanation of motor behaviour under uncertainty than the motor averaging hypothesis.

      We thank the reviewer for the kind words.

      1) The main caveat of experiment 1 is that it rules out one particular extreme version of the movement averaging idea- namely that the motor programs are averaged at the level of muscle commands or dynamics. It is still consistent with the idea that the participant first average the kinematic motor plans - and then retrieve the associated force field for this motor plan. This idea is ruled out in Experiment 2, but nonetheless I think this is worth adding to the discussion.

      This is a good point, and we have now included it in the paper as suggested – both in motivating the need for Expt 2 in the Results section and when interpreting the results of Expt 1 in the Discussion section.

      2) The logic of the correction for variability between the one-target and two-target trials in Formula 2 is not clear to me. It is likely that some of the variability in the two-target trials arises from the uncertainty in the decision - i.e. based on recent history one target may internally be assigned a higher probability than the other. This is variability the optimal controller should know about and therefore discard in the planning of the safety margin. How big was this correction factor? What is the impact when the correction is dropped ?

      Short Answer:

      (1) If decision uncertainty contributed to motor variability on 2-target trials as suggested, 2-target trials should display greater motor variability than 1-target trials. However, 1-target and 2-target trials display levels of motor variability that are essentially equal – with a difference of less than 1% overall, as illustrated in Fig R2, indicating that decision uncertainty, if present, has no clear effect on motor variability in our data.

      (2) The sigma2/sigma1 correction factor is, therefore, very close to 1, with an average value of 1.00 or 1.04 depending on how it’s computed. Thus, dropping it has little impact on the main result as shown in Fig R1.

      Longer, more detailed, answer:

      We agree that it could be reasonable to think that if it were true that motor variability on 2-target trials were consistently higher than that on 1-target trials, then the additional variability seen on 2-target trials might result from uncertainty in the decision which should not affect safety margins if the optimal controller knew about this variability. However, detailed analysis of our data suggests that this is not the case. We present several analyses below that flush this out.

      We apologize in advance that the response we provide to this seemingly straightforward comment is so lengthy (4+ pages!), especially since capitulating to the reviewer’s assertion that “correction” for the motor variability differences between 1 & 2-target trails should be removed from our analysis, would make essentially no difference in the main result, as shown Fig R1 above. Note that the error bars on the data show 95% confidence intervals. However, taking the difference in motor variability (or more specifically, it’s ratio) between 1-target and 2-target trials into account, is crucial for understanding inter-individual differences in motor responses in uncertain conditions. As this reviewer (and reviewer 2) points out below, we did a poor job of presenting the inter-individual differences analysis in the original version of this paper, but we have improved both the approach and the presentation in the current revision, and we think that this analysis is important, despite being secondary to the main result about the group-averaged findings.

      Therefore, we present analyses here showing that it is unlikely that decision uncertainty accounts for the individual-participant variability differences we observe between 1-target and 2-target trials in our experiments (Fig R2). Instead, we show that the variability differences we observe in different conditions for individual participants are due to (largely idiosyncratic) spatial differences in movement direction (Fig R3), which when taken into account, afford a clearly improved ability to predict the size of the safety margins around the obstacles, both in 1-target trials where there is no ‘decision’ to be made (Figs R4-R6) and in 2-target trials (Figs R5-R6).

      Variability is, on average, nearly identical on 1-target & 2-target trials, indicating no measurable decision-related increase in variability on 2-target trials

      At odds with the idea that decision uncertainty is responsible for a meaningful fraction of the 2-target trial variability that we measure, we find that motor variability on 2-target trials is essentially unchanged from that on one-target trials overall as shown in Fig R2 (error bars show 95% confidence intervals). This is the case for both the data from Expt 2a (6.59±0.42° vs 6.70±0.96°, p > 0.8), and for the critical data from Expt 2b that was designed to dissociate the MA hypothesis from the PO hypothesis (4.23 ±0.17° vs 4.23±0.27°, p > 0.8 for the data from Expt 2b), as well as when the data from Expts 2a-b are pooled (4.78±0.24° vs 4.81±0.35°, p > 0.8). Note that the nominal difference in motor variability between 1-target and 2-target trials was just 1.7% in the Expt 2a data, 0.1% in the Expt 2b data, and 0.6% in the pooled data. This suggests little to no overall contribution of decision uncertainty to the motor variability levels we measured in Expt 2.

      Correspondingly, the sigma2/sigma1 ‘correction factor’ (which serves to scale the safety margin observed on 1-target trials up or down based on increased or decreased motor variability on 2-target trials) is close to 1. Specifically, this factor is 1.01±0.13 (mean±SEM) for Expt 2a and 1.04±0.09 for Expt 2b, if measured as mean(sigma2i/sigma1i), where sigma1i and sigma2i are the SDs of the initial movement directions on 1-target and 2-target trials. This factor is 1.02 for Expt 2a and 1.00 for Expt 2b, if instead measured as mean(sigma2i)/mean(sigma1i), and thus in either case, dropping it has little effect on the main population-averaged results for Expt 2 presented in Fig 4b in the main paper. Fig R1 shows versions of the PO model predictions in Fig 4b computed with or without dropping the sigma2/sigma1 ‘correction factor’ that reviewer asks about. These with vs without versions are quite similar for the results from both Expt 2a and Expt 2b. In particular, the comparison between our experimental data and the population-average-based model predictions for the MA vs the PO hypotheses, show highly significant differences between the abilities of the MA and PO models to explain the experimental data in Expt 2b (Fig R1, right panel), whether or not the sigma2/sigma1 correction is included for the comparison between MA and PO predictions (p<10-13 whether or not the sigma2/sigma1 term included, p=4.31×10-14 with it vs p=4.29×10-14 without it). Analogously, for Expt 2a (where we did not expect to show meaningful differences between the MA and PO model predictions), we also find highly consistent results when the sigma2/sigma1 term is included vs not (Fig R1, left panel) (p=0.37 for the comparison between PO and MA predictions with the sigma2/sigma1 term included vs 0.38 without it).

      Analysis of left-side vs right-side 1-target trial data indicates the existence of participant-specific spatial patterns of variability.

      With the participant-averaged data showing almost identical levels of motor variability on 1-target and 2-target trials, it is not surprising that about half of participants showed nominally greater variability on 1-target trials and about half showed nominally greater variability on 2-target trials. What was somewhat surprising, however, was that 16 of the 26 individual participants in Expt 2b displayed significantly higher variability in one condition or the other at α=0.05 (and 12/26 at α=0.01). Why might this be the case? We found an analogous result when breaking down the 1-target trial data into +30° (right-target) and -30° (left-target) trials that could offer an explanation. Note that the 2-target trial data come from intermediate movements toward the middle of the workspace, whereas the 1-target trial data come from right-side or left-side movements that are directed even more laterally than the +30° or -30° targets themselves (the average movement directions to these obstacle-obstructed lateral targets were +52.8° and -49.0°, respectively, in the Expt 2b data, see Fig 4a in the main paper for an illustration). Given the large separation between 1 & 2-target trials (~50°) and between left and right 1-target trails (~100°), differences in motor variability would not be surprising. The analyses illustrated in Figs R3-R6 show that these spatial differences indeed have large intra-individual effects on movement variability (Fig R3) and, critically, large a subsequent effect on the ability to predict the safety margin observed in one movement direction from motor variability observed at another (Figs R4-R6).

      Fig R3 shows evidence for intra-individual direction-dependent differences in motor variability, obtained by looking at the similarity between within-participant spatially-matched (e.g. left vs left or right vs right, Fig R3a) compared to spatially-mismatched (left vs right, Fig R3b) motor variability across individuals. To perform this analysis fairly, we separated the 60 left-side obstacle1-target trial movements for each participant into those from odd-numbered vs even-numbered trials (30 each) to be compared. And we did the same thing for the 60 right-side obstacle 1-target trial movements. Fig R3a shows that there is a large (r=+0.70) and highly significant (p<10-6) across-participant correlation between the variability measured in the spatially-matched case, i.e. for the even vs odd trials from same-side movements, indicating that the measurement noise for measuring movement variability using n=30 movements (movement variability was measured by standard deviation) did not overwhelm inter-individual differences in movement variability.

      The strength of this correlation would increase/decrease if we had more/less data from each individual because that would decrease/increase the noise in measuring each individual’s variability. Therefore, to be fair, we maintained the same number of data points for each variability measurement (n=30) for the spatially-mismatched cases shown in Fig R3b and R3c. The strong positive relationship between odd-trial and even-trial variability across individuals that we observed in the spatially-matched case is completely obscured when the target direction is not controlled for (i.e. not maintained) within participants, even though left-target and right-target movements are randomly interspersed. In particular, Fig R3b shows that there remains only a small (r=+0.09) and non-significant (p>0.5) across-participant correlation between the variability measured for the even vs odd trials from opposite-side movements that have movement directions separated by ~100°. This indicates that idiosyncratic intra-individual spatial differences in motor variability are large and can even outweigh inter-individual differences in motor variability seen in Fig R3a. Fig R3c shows that an analogous effect holds between the laterally-directed 1-target trials and the more center-directed 2-target trials that have movement directions separated by ~50°. In this case, the correlation that remains when the target direction is not is maintained within participants, is also near zero (r=-0.13) and non-significant (p>0.3). It is possible that some other difference between 1-target & 2-target trials might also be at play here, but there is unlikely to be a meaningful effect from decision variability given the essentially equal group-average variability levels (Fig R2).

      Analysis of left-side vs right-side 1-target trial data indicates that participant-specific spatial patterns of variability correspond to participant-specific spatial differences in safety margins.

      Critically, dissection of the 1-target trial data also shows that the direction-dependent differences in motor variability discussed above for right-side vs left-side movements predict direction-dependent differences in the safety margins. In particular, comparison of panels a & b in Fig R4 shows that motor variability, if measured on the same side (e.g. the right-side motor variability for the right-side safety margin), strongly predicts interindividual differences in safety margin (r=0.60, p<0.00001, see Fig R4b). However, motor variability, if measured on the other side (e.g. the right-side motor variability for the left-side safety margin), fails to predict interindividual differences in safety margin (r=0.15, p=0.29, see Fig R4a). These data show that taking the direction-specific motor variability into account, allows considerably more accurate individual predictions of the safety margins used for these movements. In line with that idea, we also find that interindividual differences in the % difference between the motor variability measured on the left-side vs the right-side predicts inter-individual differences in the % difference between the safety margin measured on the left-side vs the right-side as shown in Fig R4c (r=0.52, p=0.006).

      Analyses of both 1-target trial and 2-target trial data indicate that participant-specific spatial patterns of variability correspond to participant-specific spatial differences in safety margins.

      Not surprisingly, the spatial/directional specificity of the ability to predict safety margins from measurements of motor variability observed in the 1-target trial data in Fig R4, is present in the 2-target data as well. Comparison of panels a-d in Fig R5 shows that motor variability from 1-target and 2-target trial data in Expt 2b strongly predict interindividual differences in 1-target and 2-target trial safety margins (r=0.72, p=3x10-5 for the 2-target trial data (see Fig R5d), r=0.59, p=1x10-3 for the 1-target trial data (see Fig R5a)).

      This is the case even though the 1-target and 2-target trial data display essentially equal population-averaged levels of motor variability. However, in Expt 2b, motor variability, if measured on 1-target trials fails to predict inter-individual differences in the safety margin on 2-target trials (r=0.18, p=0.39, see Fig R5c), and motor variability, if measured on 2 target trials fails to predict inter-individual differences in the safety margin on 1-target trials (r=-0.12, p=0.55, see Fig R5b). As an aside, note that Fig 5a is similar to 4b in content, in that 1-target trial safety margins are plotted against motor variability levels in both cases. But in 5a, the left and right- target data are averaged whereas in 4b the left and right-target data are both plotted resulting in 2N data points. Also note that the correlations are similar, r=+0.59 vs r=+0.60, indicating that in both cases the amount of motor variability predicts the size of the safety margin.

      A final analysis indicating that the spatial specificity of motor variability rather than the presence of decision variability accounts for the ability to predict safety margins is shown in Fig R6. This analysis makes use of the contrast between Expt 2b (where there is a wide spatial separation (51° on average) between 1-target trials and 2-target trials because participants steer laterally around the Expt 2b 1-target trial obstacles, i.e. away from the center), and Expt 2a (where there is only a narrow spatial separation (10.4° on average) between the movement directions of 1-target trials and 2-target trials because participants steer medially around the Expt 2a 1-target trial obstacles, i.e. toward the center). If the spatial specificity of motor variability drove the ability to predict safety margins (and thus movement direction) on 2-target trials, then such predictions should be noticeably improved in Expt 2a compared to Expt 2b, because the spatial match between 1-target trials and 2-target trials is five-fold better in Expt 2a than in Expt2b. Fig R6 shows that this is indeed the case. Specifically, comparison of the 3rd and 4th clusters of bars (i.e. the data on the right side of the plot), shows that the ability to predict 2-target trial safety margins from 1-target trial variability and conversely the ability to predict 1-target trial safety margins from 2-target trial variability are both substantially improved in Expt 2a compared to Expt 2b (compare the grey bars in the 4th vs the 3rd clusters of bars).

      Moreover, comparison of the 1st and 2nd clusters of bars (i.e. the data on the left side of the plot), shows that the ability to predict left 1-target trial safety margins from right 1-target trial variability and conversely the ability to predict right 1-target trial safety margins from left 1-target trial variability are also both substantially improved in Expt 2a compared to Expt 2b (compare the grey bars in the 1st vs the 2nd clusters of bars). This corresponds to a spatial separation between the movement directions on left vs right 1-target trials of 20.7° on average in Expt 2a in contrast to a much greater 102° in Expt 2b.

      The analyses illustrated in Figs R4-R6 make it clear that accurate prediction of interindividual differences in safety margins critically depend on spatially-specific information about motor variability, and we have, therefore, included this information for the analyses in the main paper, as it is especially important for the analysis of inter-individual differences in motor planning presented in Fig 5 of the manuscript.

      3) Equation 3 then becomes even more involved and I believe it constitutes somewhat of a distractions from the main story - namely that individual variations in the safety margin in the 1-target obstacle-obstructed movements should lead to opposite correlations under the PO and MA hypotheses with the safety margin observed in the uncertain 2-target movements (see Fig 5e). Given that the logic of the variance-correction factor (pt 2) remains shaky to me, these analyses seem to be quite removed from the main question and of minor interest to the main paper.

      The reviewer makes a good point. We agree that the original presentation made Equation 3 seem overly complex and possibly like a distraction as well. Based on the comment above and a number of comments and suggestions from Reviewer 2, we have now overhauled this content – streamlining it and making it clearer, in both motivation and presentation. Please see section 2.2 in the point-by-point response to reviewer 2 for details.

      Reviewer #2:

      The authors should be commended on the sharing of their data, the extensive experimental work, the experimental design that allows them to get opposite predictions for both hypotheses, and the detailed of analyses of their results. Yet, the interpretation of the results should be more cautious as some aspects of the experimental design offer some limitations. A thorough sensitivity analysis is missing from experiment 2 as the safety margin seems to be critical to distinguish between both hypotheses. Finally, the readability of the paper could also be improved by limiting the use of abbreviations and motivate some of the analyses further.

      We thank the reviewer for the kind words and for their help with this manuscript.

      1) The text is difficult to read. This is partially due to the fact that the authors used many abbreviations (MA, PO, IMD). I would get rid of those as much as possible. Sometimes, having informative labels could also help FFcentral and FFlateral would be better than FFA and FFB.

      We have reduced the number of abbreviations used in the paper from 11 to 4 (Expt, FF, MA, PO), and we thank the reviewer for the nice suggestion about changing FFA and FFB to FFLATERAL and FFCENTER. We agree that the suggested terms are more informative and have incorporated them.

      2) The most difficult section to follow is the one at the end of the result sections where Fig.5 is discussed. This section consists of a series of complicated analyses that are weakly motivated and explained. This section (starting on line 506) appears important to me but is extremely difficult to follow. I believe that it is important as it shows that, at the individual level, PO is also superior to MA to predict the behavior but it is poorly written and even the corresponding panels are difficult to understand as points are superimposed on each other (5b and e). In this section, the authors mention correcting for Mu1b and correcting for Sig2i/Sig1Ai but I don't know what such correction means. Furthermore, the authors used some further analyses (Eq. 3 and 4) without providing any graphical support to follow their arguments. The link between these two equations is also unclear. Why did the authors used these equations on the pooled datasets from 2a and 2b ? Is this really valid ? It is also unclear why Mu1Ai can be written as the product of R1Ai and Sig1Ai. Where does this come from ?

      We agree with the reviewer that this analysis is important, and the previous explanation was not nearly as clear as it could have been. To address this, we have now overhauled the specifics of the context in Figure 5 and the corresponding text – streamlining the text and making it clearer, in both motivation and presentation (see lines 473-545 in the revised manuscript). In addition to the improved text, we have clarified and improved the equations presented for analysis of the ability of the performance optimization (PO) model to explain inter-individual differences in motor planning in uncertain conditions (i.e. on 2-target trials) and have provided more direct graphical support for them. Eq 4 from the original manuscript has been removed, and instead we have expanded our analyses on what was previously Eq 3 (now Eq 5 in the revised manuscript). We have more clearly introduced this equation as a hybrid between using group-averaged predictions and participant-individualized predictions, where the degree of individualization for all parameters is specified with the individuation index 𝑘. For example, a value of 1 for 𝑘 would indicate complete weighting of the individuated model predictors. The equation that follows in the revised manuscript, Eq 6, is a straightforward extension of Eq 5 where each model parameter was instead multiplied by a different individuation index. With this, we now present the partial-R2 statistic associated with each model predictor (see revised Figs 5a and 5e) to elucidate the effect of each. We have, additionally, now plotted the relationships between the each of the 3 model predictors and the inter-individual differences that remain when the other two predictors are controlled (see revised Figs 5b-d and Fig 5f-h). These analyses are all shown separately for each experiment, as per the reviewer’s suggestion, in the revised version of Fig 5.

      Overall, this section is now motivated and discussed in a more straightforward manner, and now provides better graphical support for the analyses reported in the manuscript. We feel that the revised analysis and presentation (1) more clearly shows the extent to which inter-individual differences in motor planning can be explained by the PO model, and (2) does a better job of breaking down how the individual factors in the model contribute to this. We sincerely thank the reviewer for helping us to make the paper easier to follow and better illustrated here.

      3) In experiment 1, does the presence of a central target not cue the participants to plan a first movement towards the center while such a central target was never present in other motor averaging experiment.

      Unfortunately, the reviewer is mistaken here, as central target locations were present in several other experiments that advocated for motor averaging which we cite in the paper. The central target was not present on any 2-target trials in our experiments, in line with previous work. It was only present on 1-target center-target trials.

      In the adaptation domain, people complain that asking where people are aiming would induce a larger explicit component. Similarly, one could wonder whether training the participants to a middle target would not induce a bias towards that target under uncertainty.

      Any “bias” of motor output towards the center target would predict an intermediate motor output which would favor neither model because our experiment designs result in predictions for motor output on different sides of center for 2-target trials in both Expt 1 and Expt 2b. Thus we think any such effect, if it were to occur, would simply reduce the amplitude of the result. However, we found an approximately full-sized effect, suggesting that this is not a key issue.

      4) The predictions linked to experiment 2 are highly dependent on the amount of safety margin that is considered. While the authors mention these limitations in their paper, I think that it is not presented with enough details. For instance, I would like to see a figure similar to Fig.4B when the safety margin is varied.

      We apologize for any confusion here. The reviewer seems to be under the impression that we can specifically manipulate safety margins around the obstacle in making model predictions for experiment 2. This is, however, not the case for either of the two safety margins in the performance-optimization (PO) modelling. Let us clarify. First, the safety margin on 1-target trials, which serves as input to the PO model, is experimentally measured on obstacle-present 1-target trials, and thus cannot be manipulated. Second, the predicted safety margin on 2-target trials is the output of the PO model and thus cannot be manipulated. There is only one parameter in the main PO model (the one for making the PO prediction for the group-average data presented in Fig 4b, see Eq 4), and that is the motor cost weighting coefficient (𝛽). 𝛽 is implicitly present in Eq 2 as well, fixed at 1/2 in this baseline version of the PO model. It is of course true that changing the motor cost weighting will affect the model output (the predicted 2-trial safety margin), but we do not think that the reviewer is referring to that here, since he or she asks about that directly in section 2.4.4 and in section 2.4.6 below, where we provide the additional analysis requested.

      For exp1, it would be good to demonstrate that, even when varying the weight of the two one-target profiles for motor averaging, one never gets a prediction that is close to what is observed.

      Here the reviewer is referring an apparent inconsistency between our analysis of Expts 1 and 2, because in Expt 2 (but not in Expt 1) we examine the effect of varying the relative weight of the two 1-target trials for motor averaging. However, we only withheld this analysis in Expt 1 because it would have little effect. Unlike Expt 2, the measured motor output on left and right 1-target trials in Expt 1 is remarkably similar (see the left panel in Fig R7a below (which is based on Fig 2b from the manuscript)). This is because left and right 1-target trials in Expt 1 were adapted to the same FF perturbation ( FFLATERAL in both cases), whereas left and right 1-target trials in Expt 2 received very different perturbation levels, because one of these targets was obstacle-obstructed and the other was not. Therefore, varying the relative weightings in Expt 1 would have little effect on the MA prediction as shown in Fig R7b at right. We now realize that is point was not explained to readers, and we have now modified the text in the results section where the analysis of Expt 1 is discussed in order to include a summary of the explanation offered above. We thank the reviewer for surfacing this.

      It is unclear in the text that the performance optimization prediction simply consists of the force-profile for the center target. The authors should motivate this choice.

      We’re a bit unclear about this comment. This specific point is addressed in the first paragraph under the Results section, the second paragraph under the subsection titled “Adaptation to novel physical dynamics can elucidate the mechanisms for motor planning under uncertainty”, the Figure 2 captions, and in the second paragraph under the subsection titled “Adaptation to a multi-FF environment reveals that motor planning during uncertainty occurs via performance-optimization rather than motor averaging”. Direct quotes from the original manuscript are below:

      Line 143: “However, PO predicts that these intermediate movements should be planned so that they travel towards the midpoint of the potential targets in order to maximize the probability of final target acquisition. This would, in contrast to MA, predict that intermediate movements incorporate the learned adaptive response to FFB, appropriate for center-directed movements, allowing us to decisively dissociate PO from MA.”

      Line 200: “In contrast, PO would predict that participants produce the force pattern (FFB) appropriate for optimizing the planned intermediate movement since this movement maximizes the probability of successful target acquisition5,34 (Fig 1d, right).”

      Line 274: “The 2-target trial MA prediction corresponds to the average of the force profiles (adaptive responses) associated with the left and right 1-target EC trials plotted in Fig 2b, whereas the 2-target trial PO prediction corresponds to the force profile associated with the center target plotted in Fig 2b, as this is appropriate for optimizing a planned intermediate movement.”

      For the second experiment 2, the authors do not present a systematic sensitivity analysis. Fig. 5a and d is a good first step but they should also fit the data on exp2b and see how this could explain the behavior in exp 2a. Second, the authors should present the results of the sensitivity analysis like they did for the main predictions in Fig.4b.

      We thank the reviewer for these suggestions. We have now included a more-complete analysis in Fig R8 below, and presented it in the format of Fig 4b as suggested. Please note that we have included the analysis requested above in a revised version of Fig 4b in the manuscript, and ta related analysis requested in section 2.4.6 in the supplementary materials.

      Specifically, the partial version of the analysis that had been presented (where the cost weighting for PO as well as the target weighting for MA were fit on Expt 2a and cross-validated using the Expt 2b data, but not conversely fit on Expt 2b and tested on Expt 2a) was expanded to include cross-validation of the Expt 2b fit using the Expt 2a data. As expected, the results from the converse analysis (Expt2b à Expt2a) mirror the results from the original analysis (Expt 2a à Expt 2b) for the cost weighting in the PO model, where the self-fit mean squared prediction errors modestly by 11% for the Expt 2a data, and by 29% for the Expt 2b data. In contrast, for the target weighting in the MA model, the cross-validated predictions did not explain the data well, increasing the self-fit mean squared prediction errors by 115% for the Expt 2a data, and by 750% for the Expt 2b data. Please see lines 411-470 in the main paper for a full analysis.

      While I understand where the computation of the safety margin in eq.2 comes from, reducing the safety margin would make the predictions linked to the performance optimization look more and more towards the motor averaging predictions. How bad becomes the fit of the data then ?

      We think that this is essentially the same question as that asked in above in section 2.4.1. Please see our response in that section above. If that response doesn’t adequately answer this question, please let us know!

      How does the predictions look like if the motor costs are unbalanced (66 vs. 33%, 50 vs. 50% (current prediction), 33 vs. 66% ). What if, in Eq.2 the slope of the relationship was twice larger, twice smaller, etc.

      Fig R8 above shows how PO prediction would change using the 2:1 (66:33) and 1:2 (33:66) weightings suggested by the reviewer here, in comparison to the 1:1 weighting present in the original manuscript, the Expt 2a best fit weighting present in the original manuscript, and the Expt 2b best fit weighting that the reviewer suggested we include in section 2.4.2. Please note that this figure is now included as a supplementary figure to accompany the revised manuscript.

      The safety margin is the crucial element here. If it gets smaller and smaller, the PO prediction would look more and more like the MA predictions. This needs to be discussed in details. I also have the impression that the safety margin measured in exp 2a (single target trials) could be used for the PO predictions as they are both on the right side of the obstacle.

      We again apologize for the confusion. We are already using safety margin measurements to make PO predictions. Specifically, within Expt 2a, we use safety margin measurements from 1-target trials (in conjunction with variability measurements on 1 & 2 target trials) to estimate safety margins on 2-target trials. And analogously within Expt 2b, we use safety margin measurements from 1-target trials (in conjunction with variability measurements on 1 & 2 target trials) to estimate safety margins on 2-target trials. Fig 4b in the main paper shows the results of this prediction (and it now also includes the cross-validated predictions of the refined models as requested in Section 2.4.4 above. Relatedly Fig R1 in this letter shows that, at the group-average level, these predictions for 2-target trial behavior in both Expt 2a and Expt 2b are essentially identical whether they are based solely on the safety margins observed on 1-target trials or on these safety margins corrected for the relative motor variabilities on 1-target and 2-target trials.

      5) On several occasions (e.g. line 131), the authors mention that their result prove that humans form a single motor plan. They don't have any evidence for this specific aspect as they can only see the plan that is expressed. They can prove that the latter is linked to performance optimization and not to the motor averaging one. But the absence of motor averaging does not preclude the existence of other motor plans…. Line 325 is the right interpretation.

      Thanks for catching this. We agree and have now revised the text accordingly (see for example, lines 53, 134, and 693-695 in the revised manuscript).

      6) Line 228: the authors mention that there is no difference in adaptation between training and test periods but this does not seem to be true for the central target. How does that affect the interpretation of the 2-target trials data ? Would that explain the remaining small discrepancy between the refined PO prediction and the data (Fig.2f) ?

      There must be some confusion here. The adaptation levels in the training period and the test period data from the central target are indeed quite similar, with only a <10% nominal difference in adaptation between them that is not close to statistically significant (p=0.14). We also found similar adaptation levels between the training and test epochs for the lateral targets (p=0.65 for the left target and p=0.20 for the right target). We further note that the PO predictions are based on test period data. And so, even if there were a clear decrease in adaptation between training and test periods, it would not affect the fidelity of the predictions or present a problem, except in the extreme hypothetical case where the reduction was so great that the test period adaptation was not clearly different from zero (as that would infringe on the ability of the paradigm to make clearly opposite predications for the MA and PO model) – but that is certainly not the case in our data.

      Reviewer #3:

      In this study, Alhussein and Smith provide two strong tests of competing hypotheses about motor planning under uncertainty: Averaging of multiple alternative plans (MA) versus optimization of motor performance (PO). In this first study, they used a force field adaptation paradigm to test this question, asking if observed intermediate movements between competing reach goals reflected the average of adapted plans to each goal, or a deliberate plan toward the middle direction. In the second experiment, they tested an obstacle avoidance task, asking if obstacle avoidance behaviors were averaged with respect to movements to non-obstructed targets, or modulated to afford optimal intermediate movements based on a commuted "safety margin." In both experiments the authors observed data consistent with the PO hypothesis, and contradictory of the MA hypothesis. The authors thus conclude that MA is not a feasible hypothesis concerning motor planning under uncertainty; rather, people appear to generate a single plan that is optimized for the task at hand.

      I am of two minds about this (very nice) study. On the one hand, I think it is probably the most elegant examination of the MA idea to date, and presents perhaps the strongest behavioral evidence (within a single study) against it. The methods are sound, the analysis is rigorous, and it is clearly written/presented. Moreover, it seems to stress-test the PO idea more than previous work. On the other hand, it is hard for me to see a high degree of novelty here, given recent studies on the same topic (e.g. Haith et al., 2015; Wong & Haith, 2017; Dekleva et al., 2018). That is, I think these would be more novel findings if the motor-averaging concept had not been very recently "wounded" multiple times.

      We thank the reviewer for the kind words and for their help with this manuscript.

      The authors dutifully cite these papers, and offer the following reasons that one of those particular studies fell short (I acknowledge that there may be other reasons that are not as explicitly stated): On line 628, it is argued that Wong & Haith (2017) allowed for across-condition (i.e., timing/spacing constraints) strategic adjustments, such as guessing the cued target location at the start of the trial. It is then stated that, "While this would indeed improve performance and could therefore be considered a type of performance-optimization, such strategic decision making does not provide information about the implicit neural processing involved in programming the motor output for the intermediate movements that are normally planned under uncertain conditions." I'm not quite sure the current paper does this either? For example, in Exp 1, if people deliberately strategize to simply plan towards the middle on 2-target trials and feedback-correct after the cue is revealed (there is no clear evidence against them doing this), what do the results necessarily say about "implicit neural processing?" If I deliberately plan to the intermediate direction, is it surprising that my responses would inherit the implicit FF adaption responses from the associated intermediate learning trials, especially in light of evidence for movement- and/or plan-based representations in motor adaptation (Castro et al., 2011; Hirashima & Nozacki, 2012; Day et al., 2016; Sheahan et a., 2016)?

      The reviewer has a completely fair point here, and we agree that the experiments in the current study are amenable to explicit strategization. Thus, without further work, we cannot claim that the current results are exclusively driven by implicit neural processing.

      As the reviewer alludes to below, the possibility that the current results are driven by explicit processes in addition to or instead of implicit ones does not directly impact any of the analyses we present – or the general finding that performance-optimization, not motor averaging, underlies motor planning during uncertainty. Nonetheless, we have added a section in the discussion section to acknowledge this limitation. Furthermore, we highlight previous work demonstrating that restriction of movement preparation time suppresses explicit strategization (as the reviewer hints at below), and we suggest leveraging this finding in future work to investigate how motor output during goal uncertainty might be influenced under such constraints. This portion of the discussion section is quoted below:

      “An important consideration for the present results is that sensorimotor control engages both implicit and explicit adaptive processes to generate motor output47. Because motor output reflects combined contributions of these processes, determining their individual contributions can be difficult. In particular, the experiments in the present study used environmental perturbations to induce adaptive changes in motor output, but these changes may have been partially driven by explicit strategies, and thus the extent to which the motor output measured on 2-target trials reflects implicit vs explicit feedforward motor planning requires further investigation. One method for examining implicit motor planning during goal uncertainty might take inspiration from recent work showing that in visuomotor rotation tasks, restricting the amount of time available to prepare a movement appears to limit explicit strategization from contributing to the motor response48–51. Future work could dissociate the effects of MA and PO on intermediate movements in uncertain conditions at movement preparation times short enough to isolate implicit motor planning.”

      In that same vein, the Gallivan et al 2017 study is cited as evidence that intermediate movements are by nature implicit. First, it seems that this consideration would be necessarily task/design-dependent. Second, that original assumption rests on the idea that a 30˚ gradual visuomotor rotation would never reach explicit awareness or alter deliberate planning, an assumption which I'm not convinced is solid.

      We generally agree with the reviewer here. We might add that in addition to introducing the perturbation gradually, Gallivan and colleagues enforced a short movement preparation time (325ms). However, we agree that the extent to which explicit strategies contribute to motor output should clearly vary from one motor task to another, and on this basis alone, the Gallivan et al 2017 study should not be cited as evidence that intermediate movements must universally reflect implicit motor planning. We have explained this limitation in the discussion section (see quote below) and have revised the manuscript accordingly.

      “We note that Gallivan et al. 2017 attempted to control for the effects of explicit strategies by (1) applying the perturbation gradually, so that it might escape conscious awareness, and (2) enforcing a 325ms preparation time. Intermediate movements persisted under these conditions, suggesting that intermediate movements during goal uncertainty may indeed be driven by implicit processes. However, it is difficult to be certain whether explicit strategy use was, in fact, effectively suppressed, as the study did not assess whether participants were indeed unaware of the perturbation, and the preparation times used were considerably larger than the 222ms threshold shown to effectively eliminate explicit contributions to motor output."

      The Haith et al., 2015 study does not receive the same attention as the 2017 study, though I imagine the critique would be similar. However, that study uses unpredictable target jumps and short preparation times which, in theory, should limit explicit planning while also getting at uncertainty. I think the authors could describe further reasons that that paper does not convince them about a PO mechanism.

      We had omitted a detailed discussion of the Haith et al 2015 study as we think that the key findings, while interesting, have little to do with motor planning under uncertainty. But we now realize that we owe readers an explanation of our thoughts about it, which we have now included in the Discussion. This paragraph is quoted below, and we believe it provides a compelling reason why the Haith et al. 2015 study could be more convincing about PO for motor planning during uncertainty.

      “Haith and colleagues (2015) examined motor planning under uncertainty using a timed-response reaching task where the target suddenly shifted on a fraction (30%) of trials 150-550ms] before movement initiation. The authors observed intermediate movements when the target shift was modest (±45°), but direct movements towards either the original or shifted target position when the shift was large (±135°). The authors argued that because intermediate movements were not observed under conditions in which they would impair task performance, that motor planning under uncertainty generally reflects performance-optimization. This interpretation is somewhat problematic, however. In this task, like in the current study, the goal location was uncertain when initially presented; however, the final target was presented far enough before movement onset that this uncertainty was no longer present during the movement itself, as evidenced by the direct-to-target motion observed when the target location was shifted by ±135°. Therefore the intermediate movements observed when the target location shifted by ±45° are unlikely to reflect motor planning under uncertain conditions. Instead, these intermediate movements likely arose from a motor decision to supplement the plan elicited by the initial target presentation with a corrective augmentation when the plan for this augmentation was certain. The results thus provide beautiful evidence for the ability of the motor system to flexibly modulate the correction of existing motor plans, ranging from complete inhibition to conservative augmentation, when new information becomes available, but provide little information about the mechanisms for motor planning under uncertain conditions.”

      If the participants in Exp 2 were asked both "did you switch which side of the obstacle you went around" and "why did you do that [if yes to question 1]", what do the authors suppose they would say? It's possible that they would typically be aware of their decision to alter their plan (i.e., swoop around the other way) to optimize success. This is of course an empirical question. If true, it wouldn't hurt the authors' analysis in any way. However, I think it might de-tooth the complaint that e.g. the Wong & Haith study is too "explicit."

      The participants in Expts 1, 2a, and 2b were all distinct, so there was no side-switching between experiments per se. However, the reviewer’s point is well taken. Although we didn’t survey participants, it’s hard to imagine that any were unaware of which side they traveled around the obstacle in Expt 2. Certainly, there was some level of awareness in our experiments, and while we would like to believe that the main findings arose from low-level, implicit motor planning, we frankly do not know the extent to which our findings may have depended on explicit planning. We have now clarified this key point and discussed it’s implications in the discussion section of the revised paper. That said, we do still think that the direct-to-target movements in the Wong and Haith study were likely the result of a strategic approach to salvaging some reward in their task. Please see the new section in the discussion titled: “Implicit and explicit contributions to motor planning under uncertainty” which for convenience is copied below:

      Implicit and explicit contributions to motor planning under uncertainty An important consideration for the present results is that sensorimotor control engages both implicit and explicit adaptive processes to generate motor output. Because motor output reflects combined contributions of these processes, determining their individual contributions can be difficult. In particular, the experiments in the present study used environmental perturbations to induce adaptive changes in motor output, but these changes may have been partially driven by explicit strategies, and thus the extent to which the motor output measured on 2-target trials reflects implicit vs explicit feedforward motor planning requires further investigation. One method for examining implicit motor planning during goal uncertainty might take inspiration from recent work showing that in visuomotor rotation tasks, restricting the amount of time available to prepare a movement appears to limit explicit strategization from contributing to the motor response. Future work could dissociate the effects of MA and PO on intermediate movements in uncertain conditions at movement preparation times short enough to isolate implicit motor planning.

      We note that Gallivan et al. 2017 attempted to control for the effects of explicit strategies by (1) applying the perturbation gradually, so that it might escape conscious awareness, and (2) enforcing a 325ms preparation time. Intermediate movements persisted under these conditions, suggesting that intermediate movements during goal uncertainty may indeed be driven by implicit processes. However, it is difficult to be certain whether explicit strategy use was, in fact, effectively suppressed, as the study did not assess whether participants were indeed unaware of the perturbation, and the preparation times used were considerably larger than the 222ms threshold shown to effectively eliminate explicit contributions to motor output.

    1. Author Response

      Reviewer #2 (Public Review):

      This study evaluates the causal relationship between childhood obesity on the one hand, and childhood emotional and behavioral problems on the other. It applies Mendelian Randomization (MR), a family of methods in statistical genetics that uses genetic markers to break the symmetry between correlated traits, allowing inference of causation rather than mere correlation. The authors argue convincingly that previous studies of these traits, both those using non-genetic observational epidemiology methods and those using standard MR methods, may be confounded by demographic effects and familial effects. One possible example of this kind of confounding is that the idea that obesity in parents may contribute to emotional and behavioral problems in children; another is the idea that adults with emotional and behavioral issues may be more likely to have children with partners who are obese, and vice-versa. They then make use of a recently proposed "within-family" MR method, which should effectively control for these confounders, at the cost of higher uncertainty in the estimated effect size, and therefore lower power to detect small effects. They report that none of the previously reported associations of childhood BMI with anxiety, depression, or ADHD are replicated using the within-family MR method, and that in the case of depression the primary association appears to be with maternal BMI rather than the child's own BMI.

      This argument that these confounders may affect these phenotypes is fairly sound, and within-family MR should indeed do a good job of controlling for them. I do not see any major issues with the cohort itself or the choice of genetic instruments. I also do not see any major issues with the definitions or ascertainment of the phenotypes studied, though I am not an expert on any of these phenotypes in particular. I am especially satisfied with the series of analyses demonstrating that the results are robust to many variations of MR methodology. Overall, I think the positive result this study reports is very credible: that the known association between childhood BMI and depression is likely primarily due to an effect of maternal BMI rather than the child's own BMI (though given that paternal BMI has a similar effect size with only a slightly wider confidence interval, I would instead say that the effect is from parental BMI generally, not specifically maternal.)

      In the updated results based on the larger genetic data release, the estimates for the association of maternal BMI and paternal BMI with the child’s depressive symptoms are more clearly different than they were in the smaller dataset (for maternal BMI, beta= 0.11, CI:0.02,0.19, p=0.01; for paternal BMI, beta=0.02, CI:-0.09,0.12, p=0.71). Therefore, in this version, it makes sense to note an association with maternal BMI specifically.

      The main weakness of the study comes from its negative results, which the authors emphasize as their primary conclusion: that previously reported associations of childhood BMI with anxiety, depression, and ADHD are not replicated using within-family MR methods. These claims do not seem justified by the evidence presented in this study. In fact, in every panel of figures 2 and 3, the error bars for the within-family MR analysis encompass the estimates for both the regression analysis and the traditional MR analysis, suggesting that the within-family analysis provides no evidence one way or another about which of these analyses is more accurate. More generally, in order to convincingly claim that there is no causal relationship between two traits, an MR study must argue that the study would be powered to detect a relationship if one existed. Within-family MR methods are known to have less power to detect associations and less precision to estimate effect sizes than traditional MR methods or traditional observational epidemiology methods, so it is not sufficient to show that these other methods have power to detect the association. To make this kind of claim, it is necessary to include some kind of power analysis, such as a simulation study or analytic power calculations, and likely also a positive control to show that this method does have power to detect known effects in this cohort.

      We agree that it is imperative that negative (i.e. “non-significant”) results are correctly interpreted - it is just as important to discover what is unlikely to affect emotional and behavioural outcomes as what does affect them. Negative results (non-significant estimates) are neither a weakness nor strength of the study, but simply reflect the estimation error in our analysis of the data. The key question is whether our within-family MR estimates are sufficiently powered to detect effect sizes of interest or rule out clinically meaningful effect sizes – or are they simply too imprecise to draw any conclusions? As the reviewer suggests, one way to address this is via a post-hoc power calculation. We consider post-hoc power calculations redundant, since all the information about the power of our analysis is reflected in the standard errors and reported confidence intervals. Moreover, any post-hoc power calculation will be necessarily approximate compared to using the standard errors and confidence intervals which we report.

      Despite these methodological reservations, we have conducted simulations to estimate the power of our within-family models (the R code is included at the end of this document). These simulations indicate that we do have sufficient power to detect the size of effects seen for depressive symptoms and ADHD in models using the adult BMI PGS. They also indicate that we cannot rule out smaller effects for non-significant associations (e.g., for the impact of the child’s BMI on anxiety). Naturally, this is entirely consistent with the width of the confidence intervals reported in results tables and in Figures 1 and 2. However, although power calculations are important when planning a study, they make little contribution to interpretation once a study has been conducted and confidence intervals are available (e.g., https://psyarxiv.com/tcqrn/). For this reason, we comment on these simulations in this response to reviewers but do not include them in the manuscript or supplementary materials. At the same time, we have changed the language used in the manuscript to be clearer that the results were imprecise and that values contained within the confidence limits cannot be ruled out.

      For example, the discussion now includes the following:

      ‘However, within-family MR estimates using the childhood body size PGS are still consistent with small effects of the child’s BMI on all outcomes, with upper confidence limits around a 0.2 standard-deviation increase in the outcome per 5kg/m2 increase in BMI.’

      And the conclusion of the paper now reads:

      ‘Our results suggest that genetic variation associated with BMI in adulthood affects a child’s depressive and ADHD symptoms, but genetic variation associated with recalled childhood body size does not substantially affect these outcomes. There was little evidence that BMI affects anxiety. However, our estimates were imprecise, and these differences may be due to estimation error. There was little evidence that parental BMI affects a child’s ADHD or anxiety symptoms, but factors associated with maternal BMI may independently influence a child’s depressive symptoms. Genetic studies using unrelated individuals, or polygenic scores for adult BMI, may have overestimated the causal effects of a child’s own BMI.’

      Regarding a positive control: for analyses of BMI in adults, suitable positive controls would include directly measured biomarkers such as fat mass or blood pressure or reported medical outcomes like type 2 diabetes. In adolescents and younger adults, age at menarche or other measures of puberty can be used, as these are reliably influenced by BMI. However, the age of the participants for whom within-family effects are being estimated (8 years), together with the lack of any biomarkers such as fat mass (due to the questionnaire-based survey design) mean no suitable measures are available.

      Reviewer #3 (Public Review):

      Higher BMI in childhood is correlated with behavioral problems (e.g. depression and ADHD) and some studies have shown that this relationship may be causal using Mendelian Randomization (MR). However, traditional MR is susceptible to bias due to population stratification, assortative mating, and indirect effects (dynastic effects). To address this issue, Hughes et al. use within-family MR, which should be immune to the above-listed problems. They were unable to find a causal relationship between children's BMI and depression, anxiety, or ADHD. They do, however, report a causal effect of mother's BMI on depression in their children. They conclude that the causal effect of children's BMI on behavioral phenotypes such as depression and anxiety, if present, is very small, and may have been overestimated in previous studies. The analyses have been carried out carefully in a large sample and the paper is presented clearly. Overall, their assertions are justified but given that the conclusions mostly rest on an absence of an effect, I would like to see more discussion on statistical power.

      1) The authors show that the estimates of within-family MR are imprecise. It would be helpful to know how much power they have for estimating effect sizes reported previously given their sample size.

      As discussed in response to a comment from reviewer 2, the power of our results is already indicated by our standard errors and confidence intervals. Nevertheless, we conducted simulations to estimate the size of effects which we had 80% power to detect. Results, presented below, are consistent with our main results. As discussed in response to a comment from reviewer 2, we consider post-hoc power calculations redundant when standard errors and confidence intervals are reported; for this reason, we include this information in the response to reviewers but not the manuscript itself.

      2) They used the correlation between PGS and BMI to support the assertion that the former is a strong instrument. Were the reported correlations calculated across all individuals? Since we know that stratification, assortative mating, and indirect effects can inflate these correlations, perhaps a more unbiased estimate would be the proportion of children's BMI variance explained by their PGS conditioned on the parents' PGS. This should also be the estimate used in power calculations.

      The manuscript has been updated to quote Sanderson-Windmeijer conditional R2 values: the proportion of BMI variance explained by the BMI PGS for each member of a trio, conditional on the PGS of the other members of the trio, and all genetic covariates included in within-family models. Similarly, we now show Sanderson-Windmeijer conditional F-statistics for a model including the child, mother, and father’s BMI instrumented by the child, mother, and father’s PGS.

      3) In testing the association of mothers' and fathers' BMI with children's symptoms, the authors used a multivariable linear regression conditioning on the child's own BMI. Was the other parent's BMI (either by itself or using the polygenic score) included as a covariate in the multivariable and MR models? This was not entirely clear from the text or from Fig. 2. I suspect that if there were assortative mating on BMI in the parent's generation, the effect of any one parent's BMI on the child's symptoms might be inflated unless the other parent's BMI was included as a covariate (assuming both mother's and father's BMI affect the child's symptoms).

      Non-genetic models include both the mother and father’s phenotypic BMI as well as the child’s, allowing estimation of conditional effects of all three. This controls for assortative mating as noted by the reviewer. This was not previously clear - all relevant text and figure captions have been updated to clarify this.

      4) They report no evidence of cross-trait assortative mating in the parents generation. The power to detect cross-trait assortative mating in the parents' generation using PGS would depend on the actual strength of assortative mating and the respective proportions of trait variance explained by PGS. Could the authors provide an estimate of the power for this test in their sample?

      We have updated the discussion of assortative mating (in both the results and the discussion section) to note possible limitations of power and clarify that that this approach to examining assortment may not capture its full extent.

      The relevant part of the results section now reads:

      “In the parents’ generation, phenotypes were associated within parental pairs, consistent with assortative mating on these traits (Appendix 1 – Table 5). Adjusted for ancestry and other genetic covariates, maternal and paternal BMI were positively associated (beta: 0.23, 95%CI: 0.22,0.25, p<0.001), as were maternal and paternal depressive symptoms (beta: 0.18, 95%CI: 0.16,0.20, p<0.001), and maternal and paternal ADHD symptoms (beta: 0.11, 95%CI: 0.09,0.13, p<0.001). Consistent with cross-trait assortative mating, there was an association of mother’s BMI with father’s ADHD symptoms (beta: 0.03, 95%CI: 0.02,0.05, p<0.001) and mother’s ADHD symptoms with father’s depressive symptoms (beta: 0.05,95%CI: 0.05,0.06, p<0.001). Phenotypic associations can reflect the influence of one partner on another as well as selection into partnerships, but regression models of paternal polygenic scores on maternal polygenic scores also pointed to a degree of assortative mating. Adjusted for ancestry and genotyping covariates, there were small associations between parents’ BMI polygenic scores (beta: 0.01, 95%CI: 0.00,0.02, p=0.02 for the adult BMI PGS, and beta: 0.01, 95%CI: 0.00,0.02, p=0.008 for the childhood body size PGS), and of the mother’s childhood body size PGS with the father’s ADHD PGS (beta: 0.01, 95%CI: 0.00,0.02, p=0.03). We did not detect associations with pairs of other polygenic scores, which may be due to insufficient statistical power.”

      And the relevant part of the discussion section now reads:

      “We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.”

      5) Are the actual phenotypes (BMI, depression or ADHD) correlated between the parents? If so, would this not suffice as evidence of cross-trait assortative mating? It is known that the genetic correlation between parents as a result of assortative mating is a function of the correlation in their phenotypes and the heritabilities underlying the two traits (e.g., see Yengo and Visscher 2018). An alternative way to estimate the genetic correlation between parents without using PGS (which is noisy and therefore underpowered) would be to use the phenotypic correlation and heritability estimated using GREML or LDSC. Perhaps this is outside the scope of the paper but I would like to hear the author's thoughts on this.

      Associations between maternal and paternal phenotypes are consistent with a degree of assortative mating (shown below). These results have added to Appendix 1 - Table 5, which also shows associations between maternal and paternal polygenic scores, and methods and results updated accordingly (see quoted text in response to the comment above). For comparability, both sets of results are based on regression models adjusting for the mother’s and father’s ancestry PCs and genotyping covariates. We agree that analysis of assortative mating using GREML or LDSC is out of scope for this paper. As noted above, we have updated the discussion to acknowledge the limitations of the approach taken:

      ‘We found some genomic evidence of assortative mating for BMI, and cross-trait assortative mating between BMI and ADHD, but not between other traits. However, associations between polygenic scores, which only capture some of the genetic variation associated with these phenotypes, may not capture the full extent of genetic assortment on these traits.’

      6) It would be helpful to include power calculations for the MR-Egger intercept estimates.

      As with our response to the comments above, post-hoc power calculations are redundant, as all the information about the power of our analysis, including the MR-Egger is indicated by the standard errors and confidence intervals. MR-Egger is less precise than other estimators, as is made clear from the wide confidence intervals reported in the relevant tables (Appendix 1 - Tables 8 and 9). However, we have now updated the discussion to give more weight to this as a limitation. The discussion of pleiotropy in the final paragraph of the discussion now reads:

      ‘While robustness checks found little evidence of pleiotropy, these methods rely on assumptions. Moreover, MR-Egger is known to give imprecise estimates (Burgess and Thompson 2017), and confidence intervals from MR-Egger models were wide. Thus, pleiotropy cannot be ruled out.’

      Similarly, we have updated the relevant line of the results section, which now reads:

      ‘MR-Egger models found little evidence of horizontal pleiotropy, although MR-Egger estimates were imprecise (Appendix 1 - Tables 8 and 9).’

      7) Finally, what is the correlation between PGS and genetic PCs/geography in their sample? A correlation might provide evidence to support the point that classic MR effects are inflated due to stratification.

      Figures presenting the association of the child’s BMI polygenic scores and their PCs have been added to the supplementary information as Appendix 1 - Figure 2 and Appendix 1 - Figure 3. Consistent with an influence of residual stratification, a regression of the child’s BMI polygenic scores against their ancestry PCs (adjusting for genotyping centre and chip) found that 7 of the 20 PCs were associated at p<0.05 with the adult BMI PGS, and 8 of 20 with the childhood body size PGS (under the null hypothesis, we would expect one association in each case). When parental polygenic scores were added to the models, these associations attenuated towards to null.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript seeks to identify the mechanism underlying priority effects in a plantmicrobe-pollinator model system and to explore its evolutionary and functional consequences. The manuscript first documents alternative community states in the wild: flowers tend to be strongly dominated by either bacteria or yeast but not both. Then lab experiments are used to show that bacteria lower the nectar pH, which inhibits yeast - thereby identifying a mechanism for the observed priority effect. The authors then perform an experimental evolution unfortunately experiment which shows that yeast can evolve tolerance to a lower pH. Finally, the authors show that low-pH nectar reduces pollinator consumption, suggesting a functional impact on the plant-pollinator system. Together, these multiple lines of evidence build a strong case that pH has far-reaching effects on the microbial community and beyond.

      The paper is notable for the diverse approaches taken, including field observations, lab microbial competition and evolution experiments, genome resequencing of evolved strains, and field experiments with artificial flowers and nectar. This breadth can sometimes seem a bit overwhelming. The model system has been well developed by this group and is simple enough to dissect but also relevant and realistic. Whether the mechanism and interactions observed in this system can be extrapolated to other systems remains to be seen. The experimental design is generally sound. In terms of methods, the abundance of bacteria and yeast is measured using colony counts, and given that most microbes are uncultivable, it is important to show that these colony counts reflect true cell abundance in the nectar.

      We have revised the text to address the relationship between cell counts and colony counts with nectar microbes. Specifically, we point out that our previous work (Peay et al. 2012) established a close correlation between CFUs and cell densities (r2 = 0.76) for six species of nectar yeasts isolated from D. aurantiacus nectar at Jasper Ridge, including M. reukaufii.

      As for A. nectaris, we used a flow cytometric sorting technique to examine the relationship between cell density and CFU (figure supplement 1). This result should be viewed as preliminary given the low level of replication, but this relationship also appears to be linear, as shown below, indicating that colony counts likely reflect true cell abundance of this species in nectar.

      It remains uncertain how closely CFU reflects total cell abundance of the entire bacterial and fungal community in nectar. However, a close association is possible and may be even likely given the data above, showing a close correlation between CFU and total cell count for several yeast species and A. nectaris, which are indicated by our data to be dominant species in nectar.

      We have added the above points in the manuscript (lines 263-264, 938-932).

      The genome resequencing to identify pH-driven mutations is, in my mind, the least connected and developed part of the manuscript, and could be removed to sharpen and shorten the manuscript.

      We appreciate this perspective. However, given the disagreement between this perspective and reviewer 2’s, which asks for a more expanded section, we have decided to add a few additional lines (lines 628-637), briefly expanding on the genomic differences between strains evolved in bacteria-conditioned nectar and those evolved in low-pH nectar.

      Overall, I think the authors achieve their aims of identifying a mechanism (pH) for the priority effect of early-colonizing bacteria on later-arriving yeast. The evolution and pollinator experiments show that pH has the potential for broader effects too. It is surprising that the authors do not discuss the inverse priority effect of early-arriving yeast on later-arriving bacteria, beyond a supplemental figure. Understandably this part of the story may warrant a separate manuscript.

      We would like to point out that, in our original manuscript, we did discuss the inverse priority effects, referring to relevant findings that we previously reported (Tucker and Fukami 2014, Dhami et al. 2016 and 2018, Vannette and Fukami 2018). Specifically, we wrote that: “when yeast arrive first to nectar, they deplete nutrients such as amino acids and limit subsequent bacterial growth, thereby avoiding pH-driven suppression that would happen if bacteria were initially more abundant (Tucker and Fukami 2014; Vannette and Fukami 2018)” (lines 385-388). However, we now realize that this brief mention of the inverse priority effects was not sufficiently linked to our motivation for focusing mainly on the priority effects of bacteria on yeast in the present paper. Accordingly, we added the following sentences: “Since our previous papers sought to elucidate priority effects of early-arriving yeast, here we focus primarily on the other side of the priority effects, where initial dominance of bacteria inhibits yeast growth.” (lines 398-401).

      I anticipate this paper will have a significant impact because it is a nice model for how one might identify and validate a mechanism for community-level interactions. I suspect it will be cited as a rare example of the mechanistic basis of priority effects, even across many systems (not just pollinator-microbe systems). It illustrates nicely a more general ecological phenomenon and is presented in a way that is accessible to a broader audience.

      Thank you for this positive assessment.

      Reviewer #2 (Public Review):

      The manuscript "pH as an eco-evolutionary driver of priority effects" by Chappell et al illustrates how a single driver-microbial-induced pH change can affect multiple levels of species interactions including microbial community structure, microbial evolutionary change, and hummingbird nectar consumption (potentially influencing both microbial dispersal and plant reproduction). It is an elegant study with different interacting parts: from laboratory to field experiments addressing mechanism, condition, evolution, and functional consequences. It will likely be of interest to a wide audience and has implications for microbial, plant, and animal ecology and evolution.

      This is a well-written manuscript, with generally clear and informative figures. It represents a large body and variety of work that is novel and relevant (all major strengths).

      We appreciate this positive assessment.

      Overall, the authors' claims and conclusions are justified by the data. There are a few things that could be addressed in more detail in the manuscript. The most important weakness in terms of lack of information/discussion is that it looks like there are just as many or more genomic differences between the bacterial-conditioned evolved strains and the low-pH evolved strains than there are between these and the normal nectar media evolved strains. I don't think this negates the main conclusion that pH is the primary driver of priority effects in this system, but it does open the question of what you are missing when you focus only on pH. I would like to see a discussion of the differences between bacteria-conditioned vs. low-pH evolved strains.

      We agree with the reviewer and have included an expanded discussion in the revised manuscript [lines 628-637]. Specifically, to show overall genomic variation between treatments, we calculated genome-wide Fst comparing the various nectar conditions. We found that Fst was 0.0013, 0.0014, and 0.0015 for the low-pH vs. normal, low pH vs. bacteria-conditioned, and bacteria-conditioned vs. normal comparisons, respectively. The similarity between all treatments suggests that the differences between bacteria-conditioned and low pH are comparable to each treatment compared to normal. This result highlights that, although our phenotypic data suggest alterations to pH as the most important factor for this priority effect, it still may be one of many affecting the coevolutionary dynamics of wild yeast in the microbial communities they are part of. In the full community context in which these microbes grow in the field, multi-species interactions, environmental microclimates, etc. likely also play a role in rapid adaptation of these microbes which was not investigated in the current study.

      Based on this overall picture, we have included additional discussion focusing on the effect of pH on evolution of stronger resistance to priority effects. We compared genomic differences between bacteria-conditioned and low-pH evolved strains, drawing the reader’s attention to specific differences in source data 14-15. Loci that varied between the low pH and bacteria-conditioned treatments occurred in genes associated with protein folding, amino acid biosynthesis, and metabolism.

      Reviewer #3 (Public Review):

      This work seeks to identify a common factor governing priority effects, including mechanism, condition, evolution, and functional consequences. It is suggested that environmental pH is the main factor that explains various aspects of priority effects across levels of biological organization. Building upon this well-studied nectar microbiome system, it is suggested that pH-mediated priority effects give rise to bacterial and yeast dominance as alternative community states. Furthermore, pH determines both the strengths and limits of priority effects through rapid evolution, with functional consequences for the host plant's reproduction. These data contribute to ongoing discussions of deterministic and stochastic drivers of community assembly processes.

      Strengths:

      Provides multiple lines of field and laboratory evidence to show that pH is the main factor shaping priority effects in the nectar microbiome. Field surveys characterize the distribution of microbial communities with flowers frequently dominated by either bacteria or yeast, suggesting that inhibitory priority effects explain these patterns. Microcosm experiments showed that A. nectaris (bacteria) showed negative inhibitory priority effects against M. reukaffi (yeast). Furthermore, high densities of bacteria were correlated with lower pH potentially due to bacteria-induced reduction in nectar pH. Experimental evolution showed that yeast evolved in low-pH and bacteria-conditioned treatments were less affected by priority effects as compared to ancestral yeast populations. This potentially explains the variation of bacteria-dominated flowers observed in the field, as yeast rapidly evolves resistance to bacterial priority effects. Genome sequencing further reveals that phenotypic changes in low-pH and bacteriaconditioned nectar treatments corresponded to genomic variation. Lastly, a field experiment showed that low nectar pH reduced flower visitation by hummingbirds. pH not only affected microbial priority effects but also has functional consequences for host plants.

      We appreciate this positive assessment.

      Weaknesses:

      The conclusions of this paper are generally well-supported by the data, but some aspects of the experiments and analysis need to be clarified and expanded.

      The authors imply that in their field surveys flowers were frequently dominated by bacteria or yeast, but rarely together. The authors argue that the distributional patterns of bacteria and yeast are therefore indicative of alternative states. In each of the 12 sites, 96 flowers were sampled for nectar microbes. However, it's unclear to what degree the spatial proximity of flowers within each of the sampled sites biased the observed distribution patterns. Furthermore, seasonal patterns may also influence microbial distribution patterns, especially in the case of co-dominated flowers. Temperature and moisture might influence the dominance patterns of bacteria and yeast.

      We agree that these factors could potentially explain the presented results. Accordingly, we conducted spatial and seasonal analyses of the data, which we detail below and include in two new paragraphs in the manuscript [lines 290-309].

      First, to determine whether spatial proximity influenced yeast and bacterial CFUs, we regressed the geographic distance between all possible pairs of plants to the difference in bacterial or fungal abundance between the paired plants. If plant location affected microbial abundance, one should see a positive relationship between distance and the difference in microbial abundance between a given pair of plants: a pair of plants that were more distantly located from each other should be, on average, more different in microbial abundance. Contrary to this expectation, we found no significant relationship between distance and the difference in bacterial colonization (A, p=0.07, R2=0.0003) and a small negative association between distance and the difference in fungal colonization (B, p<0.05, R2=0.004). Thus, there was no obvious overall spatial pattern in whether flowers were dominated by yeast or bacteria.

      Next, to determine whether climatic factors or seasonality affected the colonization of bacteria and yeast per plant, we used a linear mixed model predicting the average bacteria and yeast density per plant from average annual temperature, temperature seasonality, and annual precipitation at each site, the date the site was sampled, and the site location and plant as nested random effects. We found that none of these variables were significantly associated with the density of bacteria and yeast in each plant.

      To look at seasonality, we also re-ordered Fig 2C, which shows the abundance of bacteria- and yeast-dominated flowers at each site, so that the sites are now listed in order of sampling dates. In this re-ordered figure, there is no obvious trend in the number of flowers dominated by yeast throughout the period sampled (6.23 to 7/9), giving additional indication that seasonality was unlikely to affect the results.

      Additionally, sampling date does not seem to strongly predict bacterial or fungal density within each flower when plotted.

      These additional analyses, now included (figure supplements 2-4) and described (lines 290-309) in the manuscript, indicate that the observed microbial distribution patterns are unlikely to have been strongly influenced by spatial proximity, temperature, moisture, or seasonality, reinforcing the possibility that the distribution patterns instead indicate bacterial and yeast dominance as alternative stable states.

      The authors exposed yeast to nectar treatments varying in pH levels. Using experimental evolution approaches, the authors determined that yeast grown in low pH nectar treatments were more resistant to priority effects by bacteria. The metric used to determine the bacteria's priority effect strength on yeast does not seem to take into account factors that limit growth, such as the environmental carrying capacity. In addition, yeast evolves in normal (pH =6) and low pH (3) nectar treatments, but it's unclear how resistance differs across a range of pH levels (ranging from low to high pH) and affects the cost of yeast resistance to bacteria priority effects. The cost of resistance may influence yeast life-history traits.

      The strength of bacterial priority effects on yeast was calculated using the metric we previously published in Vannette and Fukami (2014): PE = log(BY/(-Y)) - log(YB/(Y-)), where BY and YB represent the final yeast density when early arrival (day 0 of the experiment) was by bacteria or yeast, followed by late arrival by yeast or bacteria (day 2), respectively, and -Y and Y- represent the final density of yeast in monoculture when they were introduced late or early, respectively. This metric does not incorporate carrying capacity. However, it does compare how each microbial species grows alone, relative to growth before or after a competitor. In this way, our metric compares environmental differences between treatments while also taking into account growth differences between strains.

      Here we also present additional growth data to address the reviewer’s point about carrying capacity. Our experiments that compared ancestral and evolved yeast were conducted over the course of two days of growth. In preliminary monoculture growth experiments of each evolved strain, we found that yeast populations did reach carrying capacity over the course of the two-day experiment and population size declined or stayed constant after three and four days of growth.

      However, we found no significant difference in monoculture growth between the ancestral stains and any of the evolved strains, as shown in Figure supplement 12B. This lack of significant difference in monoculture suggests that differences in intrinsic growth rate do not fully explain the priority effects results we present. Instead, differences in growth were specific to yeast’s response to early arrival by bacteria.

      We also appreciate the reviewer’s comment about how yeast evolves resistance across a range of pH levels, as well as the effect of pH on yeast life-history traits. In fact, reviewer #2 pointed out an interesting trade-off in life history traits between growth and resistance to priority effects that we now include in the discussion (lines 535-551) as well as a figure in the manuscript (Figure 8).

    1. Author Response

      Reviewer #2 (Public Review):

      Schrecker, Castaneda and colleagues present cryo-EM structures of RFC-PCNA bound to 3'ss/dsDNA junction or nicked DNA stabilized by slowly hydrolyzable ATP analogue, ATPyS. They discover that PCNA can adopt an open form that is planar, different from previous models for the loading a sliding clamp. The authors also report a structure with closed PCNA, supporting the notion that closure of the sliding clamp does not require ATP hydrolysis. The structures explain how DNA can be threaded laterally through a gap in the PCNA trimer, as this process is supported by partial melting of the DNA prior to insertion. The authors also visualise and assign a function to the N-terminal domain in the Rfc1 subunit of the clamp loader, which they find modulates PCNA loading at the replication forks, in turn required for processive synthesis and ligation of Okazaki fragments.

      This work is extremely well done, with several structures with resolutions better than 3Å, which a significant achievement given the dynamic nature of the PCNA ring loading process. To investigate the role of the N-terminal domain of Rfc1 in PCNA loading, the authors use in vitro reconstitution of the entire DNA replication reaction, which is a powerful method to identify specific defects in Okazaki fragment synthesis and ligation.

      Important issues

      1. Figure 3B,D,F. I would find them much more informative if the authors showed the overlay between atomic model and cryo-EM density in the main figure. If the figure becomes too busy, the authors could decide to just add additional panels with the overlay as well as the atomic models alone. I do not think that showing segmented density for the DNA alone, as done is Figure 6C is sufficient. Also including the density for e.g. residues Trp638 and Phe582 seems important.

      We thank the reviewer for the suggestion. However, we have been unable to establish a way to show the density for both the protein and DNA in a meaningful manner due to the large number of atoms in the fields of view. For an example, please see Figure 1, which corresponds to Figure 3H. To aid the reader, we have revised several of the Figures and Figure Supplements to include density for the DNA.

      Consistent with our structures, recent work from the Kelch group has identified Trp638 and Phe582 as facilitating DNA base flipping (Gaubitz et al., 2022a). Despite the role in base flipping, no growth defects were observed in cells in which either of these residues were mutated and thus their functional role and the role of DNA base-flipping remains unclear.

      1. Cryo-EM samples preparation included substoichiometric RPA, which has been shown to promote DNA loading of PCNA by RFC. Would the authors expect a subset of PCNA-RFC-DNA particles to contain RPA as well? The glycerol gradient gel indicates that, at least in fraction 5, a complex might exist. If the authors think that the particles analyzed cannot contain RPA, it would be useful to mention this.

      We have no evidence to suggest that RPA cannot be present in the imaged particles. We have revised the text (lines 150 - 152) clarify that while RPA was present in the sample, we did not observe any density that could not be assigned to either DNA, RFC or PCNA. We therefore suggest that RPA does not interact with the complex in a stable manner.

      1. Published kinetic data indicate that ATP hydrolysis occurs before clamp closure. To incorporate this notion in their model, the authors suggest that ATP hydrolysis might promote PCNA closure by disrupting the planar RFC:PCNA interaction surface and hence the dynamic interaction of PCNA with Rfc2 and -5 in the open state. In addition, ATP hydrolysis promotes RFC disengagement from PCNA-DNA by reverting from a planar to an out-of-plane state. This model appears reasonable and nicely combines published data with the new findings reported by the authors. However, the model is oversimplified in Figure 6, where the only depicted effect of ATP hydrolysis is RFC release. Perhaps the authors could use the figure caption to acknowledge that ATP hydrolysis likely still has a role in facilitating PCNA closure.

      We have revised Figure 6 to show that DNA hydrolysis may occur either before or after ring closure.

      1. Can the authors explain what steps should be taken to describe PCNA loading by RFC in conditions where ATP hydrolysis is permitted? How would such experiments further inform the molecular mechanism for the loading of the PCNA clamp?

      As highlighted in point 3 above and by the other reviewers, ATP and ATPgS may alter the behavior and energetic landscape of RFC. In our studies, ATPgS was added trap the complex in a pre-hydrolysis state in which all components are assembled. We have added a section to the discussion noting the potential differences and highlighting the need for future studies to better elucidate the role of nucleotide hydrolysis. To achieve a hydrolysis competent complex, one could apply time-resolved cryo-EM approaches where the complex is formed on the grids and quickly vitrified. Such an approach, particularly if coupled with stopped-flow kinetic analyses, may provide additional insights in the kinetics of loading of PCNA onto DNA by RFC.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      The brain-machine interface used in this study differs from typical BMIs in that it's not intended to give subjects voluntary control over their environment. However, it is possible that rats may become aware of their ability to manipulate trial start times using their neural activity. Is there any evidence that the time required to initiate trials on high-coherence or low-coherence trials decreases with experience?

      This is a great question. First, we designed the experiment to avoid this possibility. Rats were experienced on the sequence of the automatic maze both pre and post implantation (totaling to weeks of pre-training and habituation). As such, the majority of the trials ever experienced by the rat were not controlled by their neural activity. During BMI experimentation, only 10% of trials were triggered during high coherence states and 10% for low coherence states, leaving ~80% of trials not controlled by their neural activity. We also implemented a pseudo-randomized trial sequence. When considered together, we specifically designed this experiment to avoid the possibility that rats would actively use their neural activity to control the maze.

      Second, we had a similar question when collecting data for this manuscript and so we conducted a pilot experiment. We took 3 rats from experiment #1 (after its completion) and we required them to perform “forced-runs” over the course of 3-4 days, a task where rats navigate to a reward zone and are rewarded with a chocolate pellet. The trajectory on “forced-runs” is predetermined and rats were always rewarded for navigating along the predetermined route. Every trial was initiated by strong mPFC-hippocampal theta coherence. We were curious as to whether time-to-trial-onset would decrease if we repeatedly paired trial onset to strong mPFC-hippocampal theta coherence. 1 out of 3 rats (rat 21-35) showed a significant correlation between time-to-trial onset and trial number, indicating that our threshold for strong mPFC-hippocampal theta coherence was being met more quickly with experience (Figure R1A). When looking over sessions and rats, there was considerable variability in the magnitude of this correlation and sometimes even the direction (Figure R1B). As such, the degree to which rat 21-35 was aware of controlling the environment by reaching strong mPFC-hippocampal theta coherence is unclear, but this question requires future experimentation.

      Author response image 1.

      Strong mPFC-hippocampal theta coherence was used to control trial onset for the entirety of forced-navigation sessions. Time-to-trial onset is a measurement of how long it took for strong coherence to be met. A) Time-to-trial onset was averaged across sessions for each rat, then plotted as a function of trial number (within-session experience on the forced-runs task). Rat 21-35 showed a significant negative correlation between time-to-trial onset and trial number, indicating that time-to-coherence reduced with experience. The rest of the rats did not display this effect. B) Correlation between trial-onset and trial number (y-axis; see A) across sessions (x-axis). A majority of sessions showed a negative correlation between time-to-trial onset and trial number, like what was seen in (A), but the magnitude and sometimes direction of this effect varied considerably even within an animal.

      Is there any evidence that rats display better performance on trials with random delays in which HPC-PFC coherence was naturally elevated?

      This question is now addressed in Extended Figure 5 and discussed in the section titled “strong prefrontal-hippocampal theta coherence leads to correct choices on a spatial working memory task”.

      The introduction frames this study as a test of the "communication through coherence" hypothesis. In its strongest form, this hypothesis states that oscillatory synchronization is a pre-requisite for inter-areal communication, i.e. if two areas are not synchronized, they cannot transfer information. Recent experimental evidence shows this relationship is more likely inverted-coherence is a consequence of inter-areal interactions, rather than a cause. See Schneider et al. (DOI: 10.1016/j.neuron.2021.09.037) and Vinck et al. (10.1016/j.neuron.2023.03.015) for a more in-depth explanation of this distinction. The authors should expand their treatment of this hypothesis in light of these findings.

      Our introduction and discussions have sections dedicated to these studies now.

      Figure 6 - It would be much more intuitive to use the labels "Rat 1", "Rat 2", and "Rat 3"; the "21-4X" identifiers are confusing.

      This was corrected in the paper.

      Figure 6C - The sub-plots within this figure are rather small and difficult to interpret. The figure would be easier to parse if the data were presented as a heatmap of the ratio of theta power during blue vs. red stim, with each pixel corresponding to one channel.

      This suggestion was implemented in the paper. See Fig 6C. Extended Fig. 8 now shows the power spectra as a function of recording shank and channel.

      Ext. Figure 2B - What happens during an acquisition failure? Instead of "Amount of LFP data," consider using "Buffer size".

      Corrected.

      Ext. Figure 2D-E - Instead of "Amount of data," consider using "Window size"

      Referred to as buffer size.

      Ext. Figure 2E - y-axis should extend down to 4 Hz. Are all of the last four values exactly at 8 Hz?

      Yes. Values plateau at 8Hz. These data represent an average over ~50 samples.

      Ext. Figure 2F - consider moving this before D/E, since those panels are summaries of panel F

      Corrected.

      Ext. Figure 4A - ANOVA tells you that accuracy is impacted by delay duration, but not what that impact is. A post-hoc test is required to show that long delays lead to lower accuracy than short ones. Alternatively, one could compute the correlation between delay duration and proportion correctly for each mouse, and look for significant negative values.

      We included supplemental analyses in Extended Fig. 4

      Reviewer #2 (Recommendations For The Authors):

      The authors should replace terms that suggest a causal relationship between PFC-HPC synchrony and behavior, such as 'leads to', 'biases', and 'enhances' with more neutral terms.

      Causal implications were toned down and wherever “leads” or “led” remains, we specifically mean in the context of coherence being detected prior to a choice being made.

      The rationale for the analysis described in the paragraph starting on line 324, and how it fits with the preceding results, was not clear to me. The authors also write at the start of this paragraph "Given that mPFC-hippocampal theta coherence fluctuated in a periodical manner (Extended Fig. 5B)", but this figure only shows example data from 2 trials.

      The reviewer is correct. While we point towards 3 examples in the manuscript now, we focused this section on the autocorrelation analysis, which did not support our observation as we noticed a rather linear decay in correlation over time. As such, the periodicity observed was almost certainly a consequence of overlapping data in the epochs used to calculate coherence rather than intrinsic periodicity.

      Shortly after the start of the results section (line 112), the authors go into a very detailed description of how they validated their BMI without first describing what the BMI actually does. This made this and the subsequent paragraphs difficult to follow. I suggest the authors start with a general description of the BMI (and the general experiment) before going into the details.

      Corrected. See first paragraph of “Development of a closed-loop…”.

      In Figure 2C, as expected, around the onset of 'high' coherence trials, there is an increase in theta coherence but this appears to be very transient. However, it is unclear what the heatmap represents: is it a single trial, single session, an average across animals, or something else? In Figure 3F, however, the increase appears to be much more sustained.

      The sample size was rats for every panel in this figure. This was clarified at the end of Fig. 3.

      In Figure 2D, it was not clear to me what units of measurement are used when the averages and error bars are calculated. What is the 'n' here? Animals or sessions? This should be made clear in this figure as well as in other figures.

      The sample size is rats. This is now clarified at the end of Fig 2.

      Describing the study of Jones and Wilson (2005), the authors write: "While foundational, this study treated the dependent variable (choice accuracy) as independent to test the effect of choice outcome on task performance." (line 83) It was not clear to me what is meant by "dependent" and "independent" here. Explaining this more clearly might clarify how the authors' study goes beyond this and other previous studies.

      The reviewer is correct. A discussion on independent/dependent variables in the context of rationale for our experiment was removed.

      Reviewer #3 (Recommendations For The Authors):

      As explained in the public review, my comments mainly concern the interpretation of the experimental paradigm and its link with previous findings. I think modifying these in order to target the specific advance allowed by the paradigm would really improve the match between the experimental and analytical data that is very solid and the author's conclusions.

      Concerning the paradigm, I recommend that the authors focus more on their novel ability to clearly dissociate the functional role of theta coherence prior to the choice as opposed to induced by the choice. Currently, they explain by contrasting previous studies based on dependent variables whereas their approach uses an independent variable. I was a bit confused by this, particularly because the task variable is not really independent given that it's based on a brain-driven loop. Since theta coherence remains correlated with many other neurophysiological variables, the results cannot go beyond showing that leading up to the decision it correlates with good choice accuracy, without providing evidence that it is theta coherence itself that enhances this accuracy as they suggest in lines 93-94.

      The reviewer is correct. A discussion on independent/dependent variables in the context of rationale for our experiment was removed.

      Regarding previous results with muscimol inactivation, I recommend that the authors expand their discussion on this point. I think that their correlative data is not sufficient to conclude as they do that despite "these structures being deemed unnecessary" (based on causal muscimol experiments), they "can still contribute rather significantly" since their findings do not show a contribution, merely a correlation. This extra discussion could include possible explanations of the apparent, and thought-provoking discrepancies that they uncover such as: theta coherence may be a correlate of good accuracy without an underlying causal relation, theta coherence may always correlate with good accuracy but only be causally important in some tasks related to spatial working memory or, since muscimol experiments leave the brain time to adapt to the inactivation, redundancy between brain areas may mask their implication in the physiological context in certain tasks (see Goshen et al 2011).

      The second paragraph of the discussion is now dedicated to this.

      Possible further analysis :

      • In Extended 4A the authors show that performance drops with delay duration. It would be very interesting to see this graph with the high coherence / low coherence / yoked trials to see if the theta coherence is most important for longer trials for example.

      This is a great suggestion. Due to 10% of trials being triggered by high coherence states, our sample size precludes a robust analysis as suggested. Given that we found an enhancement effect on a task with minimal spatial working memory requirements (Fig. 4), it seems that coherence may be a general benefit or consequence of choice processes. Nonetheless, this remains an important question to address in a future study.

      • Figure 6: The authors explain in the text that although the effect of stimulation of VMT is variable, overall VMT activation increased PFC-HPC coherence. I think in the figure the results are only shown for one rat and session per panel. It would be interesting to add a figure including their whole data set to show the overall effect as well as the variability.

      The reviewer is correct and this comment promoted significant addition of detail to the manuscript. We have added an extended figure (Ext. Fig. 9) showing our VMT stimulation recording sessions. We originally did not include these because we were performing a parameter search to understanding if VMT stimulation could increase mPFC-hippocampal theta coherence. The results section was expanded accordingly.

      Changes to writing / figures :

      • The paper by Eliav et al, 2018 is cited to illustrate the universality of coupling between hippocampal rhythms and spikes whereas the main finding of this paper is that spikes lock to non-rhythmic LFP in the bat hippocampus. It seems inappropriate to cite this paper in the sentence on line 65.

      We agree with the reviewer and this citation was removed.

      • Line 180 when explaining the protocol, it would help comprehension if the authors clearly stated that "trial initiation" means opening the door to allow the rat to make its choice. I was initially unfamiliar with the paradigm and didn't figure this out immediately.

      We added a description to the second paragraph of our first results section.

      • Lines 324 and following: the analysis shows that there is a slow decay over around 2s of the theta coherence but not that it is periodical (as in regularly occurring in time), this would require the auto-correlation to show another bump at the timescale corresponding to the period of the signal. I recommend the authors use a different terminology.

      This comment is now addressed above in our response to Reviewer #2.

      • Lines 344: I am not sure why the stable theta coherence levels during the fixed delay phase show that the link with task performance is "through mechanisms specific to choice". Could the authors elaborate on this?

      We elaborated on this point further at the end of “Trials initiated by strong prefrontal-hippocampal theta coherence are characterized by prominent prefrontal theta rhythms and heightened pre-choice prefrontal-hippocampal synchrony”

      • Line 85: "independent to test the effect of choice outcome on task performance." I think there is a typo here and "choice outcome" should be "theta coherence".

      The sentence was removed in the updated draft.

    1. Author Response

      Reviewer 1 (Public Review):

      To me, the strengths of the paper are predominantly in the experimental work, there's a huge amount of data generated through mutagenesis, screening, and DMS. This is likely to constitute a valuable dataset for future work.

      We are grateful to the reviewer for their generous comment.

      Scientifically, I think what is perhaps missing, and I don't want this to be misconstrued as a request for additional work, is a deeper analysis of the structural and dynamic molecular basis for the observations. In some ways, the ML is used to replace this and I think it doesn't do as good a job. It is clear for example that there are common mechanisms underpinning the allostery between these proteins, but they are left hanging to some degree. It should be possible to work out what these are with further biophysical analysis…. Actually testing that hypothesis experimentally/computationally would be nice (rather than relying on inference from ML).

      We agree with the reviewer that this study should motivate a deeper biophysical analysis of molecular mechanisms. However, in our view, the ML portion of our work was not intended as a replacement for mechanistic analysis, nor could it serve as one. We treated ML as a hypothesis-generating tool. We hypothesized that distant homologs are likely to have similar allosteric mechanisms which may not be evident from visual analysis of DMS maps. We used ML to (a) extract underlying similarities between homologs (b) make cross predictions across homologs. In fact, the chief conclusion of our work is that while common patterns exist across homologs, the molecular details differ. ML provides tantalizing evidence to this effect. The conclusive evidence will require, as the reviewer rightly suggests, detailed experimental or molecular dynamics characterization. Along this line, we note that we have recently reported our atomistic MD analysis of allostery hotspots in TetR (JACS, 2022, 144, 10870). See ref. 41.

      Changes to manuscript:<br /> “Detailed biophysical or molecular dynamics characterization will be required to further validate our conclusions(38).”

      Reviewer 3 (Public Review):

      However - at least in the manuscript's present form - the paper suffers from key conceptual difficulties and a lack of rigor in data analysis that substantially limits one's confidence in the authors' interpretations.

      We hope the responses below address and allay the reviewer’s concerns.

      A key conceptual challenge shaping the interpretation of this work lies in the definition of allostery, and allosteric hotspot. The authors define allosteric mutations as those that abrogate the response of a given aTF to a small molecule effector (inducer). Thus, the results focus on mutations that are "allosterically dead". However, this assay would seem to miss other types of allosteric mutations: for example, mutations that enhance the allosteric response to ligand would not be captured, and neither would mutations that more subtly tune the dynamic range between uninduced ("off) and induced ("on") states (without wholesale breaking the observed allostery). Prior work has even indicated the presence of TetR mutations that reverse the activity of the effector, causing it to act as a co-repressor rather than an inducer (Scholz et al (2004) PMID: 15255892). Because the work focuses only on allosterically dead mutations, it is unclear how the outcome of the experiments would change if a broader (and in our view more complete) definition of allostery were considered.

      We agree with the reviewer that mutations that impact allostery manifest in many different ways. Furthermore, the effect size of these mutations runs the full gamut from subtle changes in dynamic range to drastic reversal of function. To unpack allostery further, allostery of aTF can be described, not just by the dynamic range, but by the actual basal and induced expression levels of the reporter, EC50 and Hill coefficient. Given the systemic nature of allostery, a substantial fraction of aTF mutations may have some subtle impact on one or more of these metrics. To take the reviewer’s argument one step further, one would have to accurately quantify the effect size of every single amino acid mutation on all the above properties to have a comprehensive sequence-function landscape of allostery. Needless to say, this is extremely hard! Resolution of small effect sizes is very difficult, even at high sequencing depth. To the best of our knowledge, a heroic effort approaching such comprehensive analysis has been accomplished so far only once (PMID: 3491352).

      Our focus, therefore, was to screen for the strongest phenotypic impact on allostery i.e., loss of function. Mutations leading to loss of function can be relatively easily identified by cell-sorting. Because our goal was to compare hotspots across homologs, we surmised that loss of function mutations, given their strong phenotypic impact, are likely to provide the clearest evidence of whether allosteric hotspots are conserved across remote homologs.

      The reviewer raised the point of activity-reversing mutations. Yes, there are activity reversing mutations in TetR. However, they represent an insignificant fraction. In the paper cited by the reviewer, there are 15 activity-reversing mutations among 4000 screened. Furthermore, the paper shows that activity-reversing in TetR requires two-tofour mutations, while our library is exclusively single amino acid substitutions. For these reasons, we did not screen for activity-reversing mutations. Nonetheless, we agree with the reviewer that screening for activity-reversing mutations across homologs would be very interesting.

      The separation in fluorescence between the uninduced and induced states (the assay dynamic range, or fold induction) varies substantially amongst the four aTF homologs. Most concerningly, the fluorescence distributions for the uninduced and induced populations of the RolR single mutant library overlap almost completely (Figure 1, supplement 1), making it unclear if the authors can truly detect meaningful variation in regulation for this homolog.

      Yes, the reviewer is correct that the fold induction ratio varies among the four aTF homologs. However, we note that such differences are common among natural aTFs. Depending on the native downstream gene regulated by the aTF, some aTFs show higher ligand-induced activation, and others are lower. While this is not a hard and fast rule, aTFs that regulate efflux pumps tend to have higher fold induction than those that regulate metabolic enzymes. In summary, the variation in fold induction among the four aTFs is not a flaw in experimental design nor indicates experimental inconsistency but is instead just an inherent property of protein-DNA interaction strength and the allosteric response of each aTF.

      Among the four aTFs, wildtype RolR has the weakest fold induction (15-fold) which makes sorting the RolR library particularly challenging. To minimize false positives as much as possible, we require that dead mutant be present in (a) non-fluorescent cells after ligandinduction (b) non-fluorescent cells before ligand-induction (c) at least two out of the three replicates for both sorts. Additionally, for RolR specifically, we adjusted the nonfluorescent gate to be far more stringent than the other three aTFs (Fig. 1 – figure supplement 1). Furthermore, we assign residues as allosteric hotspots, not individual dead mutations. This buffers against false strong signals from stray individual dead mutations. Finally, the top interquartile range winnows them to residues showing strong consistent dead phenotype. As a result of these “safeguards” we have built in, the number of allosteric hotspots of RolR (57) is comparable to the other three aTFs (51, 53 and 48). This suggests that we are not overestimating the number of hotspots despite the weaker fold induction of RolR. We highlight in a new supplementary figure (Figure 1 – figure supplement 4) that changing the read count threshold from 5X to 10X produces near identical patterns of mutations suggesting that our results are also robust to changes in ready depth stringency.

      Changes to manuscript: In response to the reviewer's comment, we have added the following sentence.

      “We note that the lower fold induction (dynamic range) of RolR makes it particularly challenging to separate the dead variants from the rest.”

      The methods state that "variants with at least 5 reads in both the presence and absence of ligand in at least two replicates were identified as dead". However, the use of a single threshold (5 reads) to define allosterically dead mutations across all mutations in all four homologs overlooks several important factors:

      Depending on the starting number of reads for a given mutation in the population (which may differ in orders of magnitude), the observation of 5 reads in the gated nonfluorescent region might be highly significant, or not significant at all. Often this is handled by considering a relative enrichment (say in the induced vs uninduced population) rather than a flat threshold across all variants.

      We regret the lack of clarity in our presentation. We wish to better explain the rationale behind our approach. First, we understand the reviewer’s point on considering relative enrichment to define a threshold. This approach works well in DMS experiments involving genetic selections, which is commonly the case, because activity scales well with selection stringency. One can then pick enrichment/depletion relative to the middle of the read count distribution as a measure of gain or loss of function.

      Second, this strategy does not, in practice, work well for cell-sorting screens. While it may be tempting to think of cell sorting as comparably activity-scaled as genetic selections, in reality, the fidelity of fluorescent-activated cell sorters is much lower. Making quantitative claims of activity based on cell sorting enrichment can be risky. It is wiser to treat cell sorting results as yes/no binary i.e., does the mutation disrupt allostery or not. More importantly, the yes/no binary classification suffices for our need to identify if a certain mutation adversely impacts allosteric activity or not.

      Third, the above argument does not imply that all mutations have the same effect size on allostery. They don’t. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.

      Fourth, a variant to be classified as allosterically dead, it must be present both in uninduced and induced DNA-bound populations in at least two out of three replicates (four conditions total). This is a stringent criterion for selecting dead variants resulting in highly consistent regions of importance in the protein even upon varying read count thresholds. To the extent possible, we have minimized the possibility of false positive bleed-through.

      Finally, two separate normalizations were performed on the total sequence reads to be able to draw a common read count threshold 1) between experimental conditions & replicates and 2) across proteins. First, total sequencing reads were normalized to 200k total across all sample conditions (presorted, -inducer, and +inducer) and replicates for each homolog, allowing comparisons within a single protein. Next, reads were normalized again to account for differences in the theoretical size of each protein’s single-mutant library, allowing for comparisons across proteins by drawing a commont readcount cutoff. For example, total sequencing reads of RolR (4,332 possible mutants) increased by 1.18x relative to MphR (3,667 possible mutants) for a total of 236k reads.

      Changes to manuscript: We have provided substantial additional details in the Fluorescence-activated cell sorting and NGS preparation and analysis sections.

      We also added the following in the main text.

      “In other words, we use cell sorting as a binary classifier i.e., does the mutation disrupt allostery or not. We capture the effect size on individual residues, not individual mutations, by counting the number of dead mutations at a residue position. This is an important consideration because it safeguards us from minor inconsistencies that inevitably arise from cell sorting.”

      Depending on the noise in the data (as captured in the nucleotide-specific q-scores) and the number of nucleotides changed relative to the WT (anywhere between 1-3 for a given amino acid mutation) one might have more or less chance of observing five reads for a given mutation simply due to sequencing noise.

      All the reads considered in our analyses pass the Illumina quality threshold of Q-score ≥ 30 which as per Illumina represent “perfect reads with no errors or ambiguities”. This translates into a probability of 1 in 1000 incorrect base call or 99.9% base call accuracy.

      We use chip-based oligonucleotides to build our DMS library, which allows us to prespecify the exact codon that encodes a point mutation. This means the nucleotide count and protein count are the same. The scenario referred to by the reviewer i.e., “anywhere between 1-3 for a given amino acid mutation” only applies to codon randomized or errorprone PCR library generation. We regret if the chip-based library assembly part was unclear.

      Depending on the shape and separation of the induced (fluorescent) and uninduced (non-fluorescent) population distributions, one might have more or less chance of observing five reads by chance in the gated non-fluorescent region. The current single threshold does not account for variation in the dynamic range of the assay across homologs.

      We have addressed the concern raised by the reviewer on fluorescent population distributions in answers to questions 10 and 11.

      The reviewer makes an important point about the choice of sequencing threshold. We use the sequencing threshold to simply make a binary choice for whether a certain variant exists in the sorted population or not. We do not use the sequencing reads as to scale the activity of the variant. To address the reviewer's comment, we have included a new supplementary figure (Fig 1 – figure supplement 4) where we compare the data by adjust the threshold two levels – 5 and 10 reads. As is evident in the new figure, the fundamental pattern of allosteric hotspots and the overall data interpretation does not change.

      TetR: 5x – 53 hotspots, 10x – 51 hotspots

      TtgR: 5x – 51 hotspots, 10x – 51 hotspots

      MphR: 5x – 48 hotspots, 10x – 48 hotspots

      RolR: 5x – 57 hotspots, 10x – 60 hotspots

      In other words, changing the threshold to be more or less strict may have a modest impact on the overall number of hotspots in the dataset. Still, the regions of functional importance are consistent across different thresholds. We have expanded the discussion in the manuscript to address this point.

      Changes to manuscript: We have now included a new supplementary comparing hotspot data at two thresholds: Figure 1 – figure supplement 4.

      We also added the following in the main text.

      “To assess the robustness of our classification of hotspots, we determined the number of hotspots at two different sequencing thresholds – 5x and 10x. At 5x and 10x, the number of hotspots are – TetR: 53, 51; TtgR: 51, 51; MphR: 48, 48 and RolR: 57,60, respectively. Changing the threshold has a modest impact on the overall number of hotspots and the regions of functional importance are consistent at both thresholds”

      The authors provide a brief written description of the "weighted score" used to define allosteric hotspots (see y-axis for figure 1B), but without an equation, it is not clear what was calculated. Nonetheless, understanding this weighted score seems central to their definition of allosteric hotspots.

      We regret the lack of clarity in our presentation. The weighted score was used to quantify the “deadness” of every residue position in the protein. At each position in the protein, the number of mutations that inhibited activity was summed up and the ‘deadness’ of each mutation was weighted based on how many replicates is appeared to inactivate the protein. Weighted score at each residue position is given by

      Where at position x in the protein, D1 is the number of mutations dead in one replicate only, D2 is the number of mutations dead in 2 replicates, D3 is the number of mutations dead in 3 replicates, and Total is the total number of variants present in the data set (based on sequencing data). Any dead mutation that is seen in only one replicate is discarded and does not contribute to the “deadness” of the residue. Mutations seen in two and three replicates contribute to the score. We have included a new supplementary figure (Fig. 1 – figure supplement 2) to give the reader a detailed heatmap of all mutations and their impact for each protein.

      Changes to manuscript: The weighted scoring scheme is now described in greater detail under Materials and Methods in the “NGS preparation and analysis” section.

      The authors do not provide some of the standard "controls" often used to assess deep mutational scanning data. For example, one might expect that synonymous mutations are not categorized as allosterically dead using their methods (because they should still respond to ligand) and that most nonsense mutations are also not allosterically dead (because they should no longer repress GFP under either condition). In general, it is not clear how the authors validated the assay/confirmed that it is giving the expected results.

      As we state in response to question 12, we use chip-based oligonucleotides to build our DMS library, which allows us to pre-specify the exact codon that encodes a point mutation. We have no synonymous or nonsense mutations in our DMS library. Each protein mutation is encoded by a single unique codon. The only stop codon is at 3’end of the gene.

      The authors performed three replicates of the experiment, but reproducibility across replicates and noise in the assay is not presented/discussed.

      Changes to manuscript: A new supplementary table (Table 1) is now provided with the pairwise correlation coefficients between all replicates for each protein.

      In the analysis of long-range interactions, the authors assert that "hotspot interactions are more likely to be long-range than those of non-hotspots", but this was not accompanied by a statistical test (Figure 2 - figure supplement 1).

      In response to the reviewer's comment, we now include a paired t-test comparing nonhotspots and hotspots with long-range interactions in the main text.

      Changes to manuscript: In all four aTFs, hotspots constituted a higher fraction of LRIs than non-hotspots (Figure 2 – figure supplement 1; P = 0.07).

    1. Author Response

      Reviewer #1 (Public Review):

      In this study, the authors describe an elegant genetic screen for mutants that suppress defects of MCT1 deletions which are deficient in mitochondrial fatty acid synthesis. This screen identified many genes, including that for Sit4. In addition, genes for retrograde signaling factors (Rtg1, Rtg2 and Rtg3), proteins influencing proteasomal degradation (Rpn4, Ubc4) or ribosomal proteins (Rps17A, Rps29A) were found. From this mix of components, the authors selected Sit4 for further analysis. In the first part of the study, they analyzed the effect of Sit4 in context of MCT1 mutant suppression. This more specific part is very detailed and thorough, the experiments are well controlled and convincing. The second, more general part of the study focused on the effect of Sit4 on the level of the mitochondrial membrane potential. This part is of high general interest, but less well developed. Nevertheless, this study is very interesting as it shows for the first time that phosphate export from mitochondrial is of general relevance for the membrane potential even in wild type cells (as long as they live from fermentation), that the Sit4 phosphatase is critical for this process and that the modulation of Sit4 activity influences processes relying on the membrane potential, such as the import of proteins into mitochondria. However, some aspects should be further clarified.

      1) It is not clear whether Sit4 is only relevant under fermentative conditions. Does Sit4 also influence the membrane potential in respiring cells? Fig. S2D shows the membrane potential in glucose and raffinose. Both carbon sources lead to fermentative growths. The authors should also test whether Sit4 levels influence the membrane potential when cells are grown under respirative conditions, such in ethanol, lactate or glycerol. Even if deletions of Sit4 affect respiration, mutants with altered activity can be easily analyzed.

      sit4Δ cells fail to grow on nonfermentable media as shown by us (Figure 2—figure supplement 1C) and others (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). In our opinion, the exact reason is unclear, but there is an interesting observation that addition of aspartate can partially restore growth on ethanol (Jablonka et al., 2006). Despite the lack of thorough investigation on this sit4Δ defect, an early study speculated that this defect could be related to the cAMP-PKA pathway (Sutton et al., 1991). This study pointed out genetic interactions of SIT4 with multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistant, and failure to grow on nonfermentable media (Sutton et al., 1991). We have not found sit4Δ mutants that could grow on nonfermentable media based on literature search.

      2) The authors should give a name to the pathway shown in Fig. 4D. This would make it easier to follow the text in the results and the discussion. This pathway was proposed and characterized in the 90s by George Clark-Walker and others, but never carefully studied on a mechanistic level. Even if the flux through this pathway cannot be measured in this study, the regulatory role of Sit4 for this process is the most important aspect of this manuscript.

      We now refer this mechanism as the mitochondrial ATP hydrolysis pathway.

      3) To further support their hypothesis, the authors should show that deletion of Pic1 or Atp1 wipes out the effect of a Sit4 deletion. In these petite-negative mutants, the phosphate export cycle cannot be carried out and thus, Sit4, should have no effect.

      The mitochondrial phosphate transport activity is electroneutral as it also pumps a proton together with inorganic phosphate. The F1 subunit of the ATP synthase (Atp1 and Atp2) is suggested among many literatures to be responsible for the ATP hydrolysis. We performed tetrad dissection to generate atp1Δ or atp2Δ in pho85Δ background. After streaking the single colony to a fresh plate, we noticed that atp1Δ mct1Δ and atp2Δ mct1Δ cells are lethal, and knocking out PHO85 rescued this synthetic lethality. It is not surprising that atp1Δ mct1Δ or atp2Δ mct1 Δ cells are lethal since the F1 subunit is important to generate a minimum of MMP in mct1 Δ cells when the ETC is absent (i.e., rho0 cells). However, knocking out PHO85 can generate MMP independent of F1 subunit of ATP synthase, which is suggested by the viable atp1Δ mct1Δ pho85Δ and atp2Δ mct1Δ pho85Δ cells. There are many ATPases in the mitochondrial matrix that could hydrolyze ATP for ADP/ATP carrier to generate MMP theoretically. However, we do not currently know exactly which ATPase(s) is activated by phosphate starvation. This data is now included as Figure 5—figure supplement 1F-G.

      4) What is the relevance of Sit4 for the Hap complex which regulates OXPHOS gene expression in yeast? The supplemental table suggests that Hap4 is strongly influenced by Sit4. Is this downstream of the proposed role in phosphate metabolism or a parallel Sit4 activity? This is a crucial point that should be addressed experimentally.

      To investigate the role of the Hap complex in MMP generation in sit4Δ cells, we overexpressed and knocked out HAP4, the catalytic subunit of the Hap complex, separately in wild-type and sit4Δ cells. We confirmed the HAP4 overexpression by the enriched abundance of ETC complexes as shown in the BN-PAGE (Figure 2—figure supplement 1E). However, we did not observe any rescue of ETC or ATP synthase in mct1Δ cells when HAP4 was overexpressed. The enriched level of ETC complexes by HAP4 overexpress is not sufficient to rescue the MMP (Figure 2—figure supplement 1F).

      Next, we knocked out HAP4 in sit4Δ cells. Knocking out SIT4 could still increase MMP in hap4Δ cells with a much-reduced magnitude, which phenocopied ETC subunit and RPO41 deletion in sit4Δ cells (Figure 2—figure supplement 1G).

      In conclusion, the Hap complex is involved in the MMP increase when SIT4 is absent. However, it is not sufficient to increase MMP by overexpressing HAP4. The Hap complex discussion is now included in the manuscript, and the data is presented as Figure 2—figure supplement 1E-G.

      5) The authors use the accumulation of Ilv2 precursors as proxy for mitochondrial protein import efficiency. Ilv2 was reported before as a protein which, if import into mitochondria is slow, is deviated into the nucleus in order to be degraded (Shakya,..., Hughes. 2021, Elife). Is it possible that the accumulation of the precursor is the result of a reduced degradation of pre-Ilv2 in the nucleus rather than an impaired mitochondrial import? Since a number of components of the ubiquitin-proteasome system were identified with Sit4 in the same screen, a role of Sit4 in proteasomal degradation seems possible. This should be tested.

      We thank the reviewer for pointing out this potential caveat with our Ilv2-FLAG reporter. With limited search and tests, we could not find another reporter that behaves like Ilv2FLAG. The reason Ilv2-FLAG is a perfect reporter for this study is because in wild-type cells, Ilv2-FLAG is not 100% imported. Therefore, we could demonstrate that mitochondria with higher MMP import more efficiently. Unfortunately, all of the mitochondrial proteins that we tested could efficiently import in wild-type cells. To identify other suitable mitochondrial proteins that behave like Ilv2-FLAG, we would need to conduct a more comprehensive screen.

      To address the concern of the involvement of protein degradation in obscuring the interpretation of Ilv2-FLAG import, we performed two experiments. First, we measured the proteasomal activity in wild-type and our mutants using a commercial kit (Cayman). We did not observe a statistically significant difference in 20S proteasomal activity between wild-type and sit4Δ cells.

      In the second experiment, we reduced the MMP of sit4 cells using CCCP treatment and measured the Ilv2-FLAG import. We first treated sit4Δ cells with different dosage of CCCP for six hours and measured their MMP. sit4Δ cells treated with 75 µM CCCP had comparable MMP to wild-type cells. When we treated sit4Δ cells with higher concentrations of CCCP, most of the cells did not survive after six hours. Next, we performed the Ilv2-FLAG import assay. We observed similar level of unimported Ilv2FLAG (marked with *) in sit4Δ cells treated with 75 µM CCCP. This result confirms that sit4Δ cells have similar Ilv2-FLAG turnover mechanism and activity as the wild-type cells, because when we lower the MMP in sit4Δ background we observe a similar level of unimported Ilv2-FLAG. We thus feel confident in concluding that the Ilv2-FLAG import results are indeed an accurate proxy for MMP level. These data are now included as Figure 1—figure supplement 1H-J in the manuscript.

      Author response image 1.

      Reviewer #2 (Public Review):

      This study reports interesting findings on the influence of a conserved phosphatase on mitochondrial biogenesis and function. In the absence of it, many nucleus-encoded mitochondrial proteins among which those involved in ATP generation are expressed much better than in normal cells. In addition to a better understanding of th mechanisms that regulate mitochondrial function, this work may help developing therapeutic strategies to diseases caused by mitochondrial dysfunction. However there are a number of issues that need clarification.

      1) The rationale of the screening assay to identify genes required for the gene expression modifications observed in mct1 mutant is not clear. Indeed, after crossing with the gene deletion libray, the cells become heterozygote for the mct1 deletion and should no longer be deficient in mtFAS. Thank you for clarifying this and if needed adjust the figure S1D to indicate that the mated cells are heterozygous for the mct1 and xxx mutations.

      We updated the methods section and the graphic for the genetic screen to clarify these points within the SGA workflow overview. After we created the heterozygote by mating mct1Δ cells with the individual KO cells in the collection, these diploids underwent sporulation and selection for the desired double KO haploid. As a result, the luciferase assay was performed in haploid cells with MCT1 and one additional non-essential gene deleted.

      2) The tests shown in Fig. S1E should be repeated on individual subclones (at least 100) obtained after plating for single colonies a glucose culture of mct1 mutant, to determine the proportion of cells with functional (rho+) mtDNA in the mct1 glucose and raffinose cultures. With for instance a 50% proportion of rho- cells, this could substantially influence the results of the analyses made with these cells (including those aiming to evaluate the MMP).

      We agree that this would provide a more confident estimate for population-level characterization of these colonies. It is important to note that we randomly chose 10 individual subclones, and 100% of these colonies were verified to be rho+. This suggests the population has functional mtDNA, and thus felt confident in the identity of our populations.

      3) The mitochondria area in mct1 cells (Fig.S1G) does not seem to be consistent with the tests in Fig. 1C. that indicate a diminished mitochondrial content in mct1 cells vs wild-type yeast. A better estimate (by WB for instance) of the mitochondrial content in the analyzed strains would enable to better evaluate MMP changes monitored with Mitotracker since the amount of mitochondria in cells correlate with the intensity of the fluorescence signal.

      As this reviewer pointed out, we quantified mitochondrial area based on Tom70-GFP signal. This measurement is quantified by mitochondrial area over cell size. Cell size is an important parameter when measuring organelle size as most of the organelles scale up and down with the cell size. mct1Δ cells generally have smaller cell size than WT cells. Therefore, the mitochondrial area of mct1Δ cells was not significantly different from WT cells when scaled to cell size. We believe this is the best method to compare mitochondrial area. As for quantifying MMP from these microscopy images, we measured the average MitoTracker Red fluorescence intensity of each mitochondria defined by Tom70-GFP. This method inherently normalizes to subtract the influence of mitochondria area when quantifying MMP.

      4) Page 12: "These data demonstrate that loss of SIT4 results in a mitochondrial phenotype suggestive of an enhanced energetic state: higher membrane potential, hyper-tubulated morphology and more effective protein import." Furthermore, the sit4 mutant shows higher levels of OXPHOS complexes compared to WT yeast.

      Despite these beneficial effects on mitochondria, the sit4 deletion strain fails to grow on respiratory substrates. It would be good to know whether the authors have some explanation for this apparent contradiction.

      We agree that this was initially puzzling. We provide a more complete explanation above (see comments to reviewer #1 - major concern #1). Briefly, the growth deficiency in non-fermentable media with sit4Δ cells was reported and studied by multiple groups (Arndt et al., 1989; Dimmer et al., 2002; Jablonka et al., 2006). These seems to indicate that sit4Δ cells contain more ETC complexes and more OCR but cannot respire on nonfermentable carbon source. However, we do not think there is yet a clear explanation for this phenotype. One interesting observation reported is the addition of aspartate partly restoring cells’ growth on ethanol (Jablonka et al., 2006). One early study speculates that this defect could be related to the cAMP-PKA pathway. Sutton et al. pointed out genetic interactions with sit4 and multiple genes in cAMP-PKA (Sutton et al., 1991). In addition, sit4Δ cells have similar phenotypes as those cAMP-PKA null mutants, such as glycogen accumulation, caffeine resistance, and failure to grow on non-fermentable media. However, to keep this manuscript succinct, we opted to stay focused on MMP.

      Reviewer #3 (Public Review):

      In this study, the authors investigate the genetic and environmental causes of elevated Mitochondrial Membrane Potential (MMP) in yeast, and also some physiological effects correlated with increased MMP.

      The study begins with a reanalysis of transcriptional data from a yeast mutant lacking the gene MCT1 whose deletion has been shown to cause defects in mitochondrial fatty acid synthesis. The authors note that in raffinose mct1del cells, unlike WT cells, fail to induce expression of many genes that code for subunits of the Electron Transport Chain (ETC) and ATP synthase. The deletion of MCT1 also causes induction of genes involved in acetyl-CoA production after exposure to raffinose. The authors therefore conduct a screen to identify mutants that suppress the induction of one of these acetylCoA genes, Cit2. They then validate the hits from this screen to see which of their suppressor mutants also reduce expression in four other genes induced in a mct1del strain. This yielded 17 genes that abolished induction of all 5 genes tested in an mct1del background during growth on raffinose.

      The authors chose to focus on one of these hits, the gene coding for the phosphatase SIT4 (related to human PP6) which also caused an increase in expression of two respiratory chain genes. The authors then investigated MMP and mitochondrial morphology in strains containing SIT4 and MCT1 deletions and surprisingly saw that sit4del cells had highly elevated MMP, more reticular mitochondria, and were able to fully import the acetolactate synthase protein Ilv2p and form ETC and ATP synthase complexes, even in cells with an mct1del background, rescuing the low MMP, fragmented mitochondria, low import of Ilv2 and an inability to form ETC and ATP synthase complexes phenotypes of the mct1del strain. Surprisingly, the authors find that even though MMP is high and ETC subunits are present in the sit4del mct1del double deletion strain, that strain has low oxygen consumption and cannot grow under respiratory conditions, indicating that the elevated MMP cannot come from fully functional ETC subunits. The authors also observe that deleting key subunits of ETC complex III (QCR2) and IV (COX5) strongly reduced the MMP of the sit4del mutant, which would suggest that the majority of the increase in MMP of the sit4del mutant was dependant on a partially functional ETC. The authors note that there was still an increase in MMP in the qcr2del sit4del and cox4del sit4del strains relative to qcr2del and cox4del strains indicating that some part of the increase in MMP was not dependent on the ETC.

      The authors dismiss the possibility that the increase in MMP could have been through the reversal of ATP synthase because they observe that inhibition of ATP synthase with oligomycin led to an increase of MMP in sit4del cells. Indicating that ATP synthase is operating in a forward direction in sit4del cells.

      Noting that genes for phosphate starvation are induced in sit4del cells, the authors investigate the effects of phosphate starvation on MMP. They found that phosphate starvation caused an increase in MMP and increased Ilv2p import even in the absence of a mitochondrial genome. They find that inhibition of the ADP/ATP carrier (AAC) with bongkrekic acid (BKA) abolishes the increase of MMP in response to phosphate starvation. They speculate that phosphate starvation causes an increase in MMP through the import and conversion of ATP to ADP and subsequent pumping of ADP and inorganic phosphate out of the mitochondria.

      They further show that MMP is also increased when the cyclin dependent kinase PHO85 which plays a role in phosphate signaling is deleted and argue that this indicates that it is not a decrease in phosphate which causes the increase in MMP under phosphate starvation, but rather the perception of a decrease in phosphate as signalled through PHO85. Unlike in the case of SIT4 deletion, the increase in MMP caused by the deletion of pho85 is abolished when MCT1 is deleted.

      Finally they show an increase in MMP in immortalized human cell lines following phosphate starvation and treatment with the phosphate transporter inhibitor phosphonoformic acid (PFA). They also show an increase in MMP in primary hepatocytes and in midgut cells of flies treated with PFA.

      The link between phosphate starvation and elevated MMP is an important and novel finding and the evidence is clear and compelling. Based on their experiments in various mammalian contexts, this link appears likely to be generalizable, and they propose and begin to test an interesting hypothesis for how MMP might occur in response to phosphate starvation in the absence of the Electron Transport Chain.

      The link between phosphate starvation and deletion of the conserved phosphatase SIT4 is also interesting and important, and while the authors' experiments and analysis suggest some connection between the two observations, that connection is still unclear.

      Major points

      Mitotracker is great fluorescent dye, but it measures membrane potential only indirectly. There is a danger when cells change growth rates, ion concentrations, or when the pH changes, all MMP indicating dyes change in fluorescence: their signal is confounded Change in phosphate levels can possibly do both, alter pH and ion concentrations. Because all conclusions of the manuscript are based on a change in MMP, it would be a great precaution to use a dye-independent measure of membrane potential, and confirm at least some key results.

      Mitochondrial MMP does strongly influence amino acid metabolism, and indeed the SIT4 knockout has a quite striking amino acid profile, with histidine, lysine, arginine, tyrosine being increased in concentration. http://ralser.charite.de/metabogenecards/Chr_04/YDL047W.html Could this amino acid profile support the conclusions of the authors? At least lysine and arginine are down in petites due to a lack of membrane potential and iron sulfur cluster export.- and here they are up. Along these lines, according to the same data resource, the knock-outs CSR2, ASF1, SSN8, YLR0358 and MRPL25 share the same metabolic profile. Due to limited time I did not re-analyse the data provided by the authors- but it would be worth checking if any of these genes did come up in the screens of the authors.

      We tested the mutants within the same cluster as SIT4 shown in this paper from the deletion collection and measured their MMP. yrl358cΔ cells have similar high MMP as observed in sit4Δ cells. However, this gene has a yet undefined function. Beyond YRL358C, we did not observe similar MMP increases in other gene deletions from this panel, which does not support the notion that amino acids such as histidine, lysine, arginine, or tyrosine play a determining effect in driving MMP.

      The media condition and strain used in the suggested paper is very different from what we used in our study. Instead of growing prototrophic cells in minimal media without any amino acids, we used auxotrophic yeast strains and grew them in media containing complete amino acids. So far, none of the other defects or signaling associated with SIT4 deletion could influence MMP as much as the phosphate signaling. We interpret these data to support the hypothesis that the MMP observation in sit4Δ cells is connected with the phosphate signaling as illustrated by the second half of the story in our manuscript.

      Author reponse image 2.

      One important claim in the manuscript attempts to explain a mechanism for the MMP increase in response to phosphate starvation which is independent of the ETC and ATP synthase.

      It seems to me the only direct evidence to support this claim is that inhibition of the AAC with BKA stops the increase of mitotracker fluorescence in response to phosphate starvation in both WT and rho0 cells (Figs 4B and 4C). It would strengthen the paper if the authors could provide some orthogonal evidence.

      This is a similar comment as raised by reviewer #1 - major concern #3. We refer the reviewer to our discussion and the new data above. Briefly, we do not think F1 subunit is responsible for the ATP hydrolysis activity to generate MMP in phosphate depleted situation. We believe there are additional ATPase(s) in the mitochondrial matrix that can be utilized to couple to ADP/ATP carrier for MMP generation during phosphate starvation. However, we have not identified the relevant ATPase(s) at this point, and it is likely that multiple ATPases could contribute to this activity.

      Introduction/Discussion The author might want to make the reader of the article aware that the 'reversal' of the ATP synthase directionality -i.e. ATP hydrolysis by the ATP synthase as a mechanism to create a membrane potential (in petites), has always been a provocative idea - but one that thus far could never be fully substantiated. Indeed some people that are very familiar with the topic, are skeptical this indeed happens. For instance, Vowinckel et al 2021 (PMID: 34799698) measured precise carbon balances for peptide cells, and found no evidence for a futile cycle - peptides grow slower, but accumulate the same biomass from glucose as peptides that re-evolve at a fast growth rate . Perhaps the manuscript could be updated accordingly.

      We thank the reviewer for pointing out this additional relevant study. We have rephased the referenced sentence in the introduction. The MMP generation in phosphate starvation is independent of the F1 portion of ATP synthase. Therefore, our data neither supports or refutes either of these arguments.

      In the introduction and conclusion there is discussion of MMP set points. In particular the authors state:

      "Critically, we find that cells often prioritize this MMP setpoint over other bioenergetic priorities, even in challenging environments, suggesting an important evolutionary benefit."

      This does not seem to be consistent with the central finding of the manuscript that MMP changes under phosphate starvation. MMP doesn't seem so much to have a 'set point' but rather be an important physiological variable that reacts to stimuli such as phosphate starvation.

      The reviewer raises a rational alternative hypothesis to the one that we have proposed. In reality, both of these are complete speculations to explain the data and we can’t think of any way to test the evolutionary basis for the mechanisms that we describe. We recognize that untested/untestable speculative arguments have limitations and there are viable alternative hypotheses. We have softened our language to ensure that it is clear that this is only a speculation.

      The authors suggest that deletion of Pho85 causes an increase in MMP because of cellular signaling. However, they also state in the conclusion:

      "Unlike phosphate starvation, the pho85D mutant has elevated intracellular phosphate concentrations. This suggests that the phosphate effect on MMP is likely to be elicited by cellular signaling downstream of phosphate sensing rather than some direct effect of environmental depletion of phosphate on mitochondrial energetics."

      The authors should cite the study that shows deletion of PHO85 causes increased intracellular phosphate concentrations. It also seems possible that the 'cellular signaling' that causes the increase in MMP could be a result of this increase in intracellular phosphate concentrations, which could constitute a direct effect of an environmental overload of phosphate on mitochondrial energetics.

      We now cited the literature that shows higher intracellular phosphate in pho85Δ cells (Gupta et al., 2019; Liu et al., 2017). Depleting phosphate in the media drastically reduced intracellular phosphate concentration, which is the opposing situation as pho85Δ cells. Nevertheless, we observed higher MMP in either situation. We concluded from these two observations that the increase in MMP is a response to the signaling activated by phosphate depletion rather than the intracellular phosphate abundance.

      Related to this point, in the conclusion, the authors state:

      "We now show that intracellular signaling can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome."

      In sum, the data shows that signaling is important here- but signaling alone is only the message - not the biophysical process that creates a membrane potential. The authors then could revise this slightly.

      We have rephrased this sentence as suggested, which now reads “We now show that intracellular signaling triggers a process that can lead to an increased MMP even beyond the wild-type level in the absence of mitochondrial genome”.

      The authors state in the conclusion that

      "We first made the observation that deletion of the SIT4 gene, which encodes the yeast homologue of the mammalian PP6 protein phosphatase, normalized many of the defects caused by loss of mtFAS, including gene expression programs, ETC complex assembly, mitochondrial morphology, and especially MMP (Fig. 1)"

      The data shown though indicates that a defect in mtFAS in terms of MMP, deletion of SIT4 causes a huge increase (and departure away from normality) whether or not mct1 is present (Fig 1D)

      We changed the word “normalized” to “reversed”. In the discussion section, we also emphasized that many of these increases are independent of mitochondrial dysfunction induced by loss of mtFAS.

      The language "SIT4 is required for both the positive and negative transcriptional regulation elicited by mitochondrial dysfunction" feels strong. SIT4 seems to influence positive transcriptional regulation in response to mitochondrial dysfunction caused by MCT1 deletion (but may not be the only thing as there appears to be an increase in CIT2 expression in a sit4del background following a further deletion of MCT1). In terms of negative regulation, SIT4 deletion clearly affects the baseline, but MCT1 deletion still causes down regulation of both examples shown in Fig 1B, showing that negative transcriptional regulation can still occur in the absence of SIT4. The authors might consider showing fold change of expression as they do in later figures (Figs 4B and C) to help the reader evaluate the quantitative changes they demonstrate.

      We now displayed the fold change as suggested. This sentence now reads “These data suggest that SIT4 positively and negatively influences transcriptional regulation elicited by mitochondrial dysfunction”.

      The authors induce phosphate starvation by adding increasing amounts of potassium phosphate monobasic at a pH of 4.1 to phosphate dropout media supplemented with potassium. The authors did well to avoid confounding effects of removing potassium. The final pH of YNB is typically around 5.2. Is it possible that the authors are confounding a change in pH with phosphate starvation? One would expect the media in the phosphate starvation condition to have a higher pH than the phosphate replacement or control media. Is a change in pH possibly a confounding factor when interpreting phosphate starvation? Perhaps the authors could quantify the pH of the media they use for the experiment to understand how much of a factor that could be. One needs to be careful with Miotracker and any other fluorescent dye when pH changes. Albeit having constraints on its own, MitoLoc as a protein rather than small molecule marker of MMP might be a good complement.

      We followed the protocol used by many other studies that depleted phosphate in the media. The reason we and others adjusted the media without inorganic phosphate to a pH of 4.1 is because that is the pH of phosphate monobasic. From there, we could add phosphate monobasic to create +Pi media without changing the media pH. Therefore, media containing different concentrations of phosphate all have the exact same pH. We now emphasize that all media containing different levels of inorganic phosphate have the same pH to the manuscript to eliminate such concern (see page 18).

      Even though all media have the similar pH, we also provided complementary data using a parallel approach to measure the MMP by assessing mitochondrial protein import as demonstrated previously with Ilv2-FLAG, which shares the same principle as mitoLoc.

      Reference

      Arndt, K. T., Styles, C. A., & Fink, G. R. (1989). A suppressor of a HIS4 transcriptional defect encodes a protein with homology to the catalytic subunit of protein phosphatases. Cell, 56(4), 527–537. https://doi.org/10.1016/00928674(89)90576-X

      Dimmer, K. S., Fritz, S., Fuchs, F., Messerschmitt, M., Weinbach, N., Neupert, W., & Westermann, B. (2002). Genetic basis of mitochondrial function and morphology in Saccharomyces cerevisiae. Molecular Biology of the Cell, 13(3), 847–853. https://doi.org/10.1091/mbc.01-12-0588

      Gupta, R., Walvekar, A. S., Liang, S., Rashida, Z., Shah, P., & Laxman, S. (2019). A tRNA modification balances carbon and nitrogen metabolism by regulating phosphate homeostasis. ELife, 8, e44795. https://doi.org/10.7554/eLife.44795

      Jablonka, W., Guzmán, S., Ramírez, J., & Montero-Lomelí, M. (2006). Deviation of carbohydrate metabolism by the SIT4 phosphatase in Saccharomyces cerevisiae. Biochimica et Biophysica Acta (BBA) - General Subjects, 1760(8), 1281–1291. https://doi.org/10.1016/j.bbagen.2006.02.014

      Liu, N.-N., Flanagan, P. R., Zeng, J., Jani, N. M., Cardenas, M. E., Moran, G. P., & Köhler, J. R. (2017). Phosphate is the third nutrient monitored by TOR in Candida albicans and provides a target for fungal-specific indirect TOR inhibition. Proceedings of the National Academy of Sciences, 114(24), 6346–6351. https://doi.org/10.1073/pnas.1617799114

      Sutton, A., Immanuel, D., & Arndt, K. T. (1991). The SIT4 protein phosphatase functions in late G1 for progression into S phase. Molecular and Cellular Biology, 11(4), 2133–2148.

    1. Author Response:

      Reviewer #1 (Public Review):

      Cell surface proteins are of vital interest in the functions and interactions of cells and their neighbors. In addition, cells manufacture and secrete small membrane vesicles that appear to represent a subset of the cell surface protein composition.

      Various techniques have been developed to allow the molecular definition of many cell surface proteins but most rely on the special chemistry of amino acid residues in exposed on the parts of membrane proteins exposed to the cell exterior.

      In this report Kirkemo et al. have devised a method that more comprehensively samples the cell surface protein composition by relying on the membrane insertion or protein glycan adhesion of an enzyme that attaches a biotin group to a nearest neighbor cellular protein. The result is a more complex set of proteins and distinctive differences between normal and a myc oncogene tumor cells and their secreted extracellular vesicle counterparts. These results may be applied to the identification of unique cell surface determinants in tumor cells that could be targets for immune or drug therapy. The results may be strengthened by a more though evaluation of the different EV membrane species represented in the broad collection of EVs used in this investigation.

      We thank the reviewer for recognizing the importance of the work outlined in the manuscript. We have addressed the necessary improvements in the essential revisions section above.

      Reviewer #2 (Public Review):

      This paper describes two methods for labeling cell-surface proteins. Both methods involve tethering an enzyme to the membrane surface to probe the proteins present on cells and exosomes. Two different enzyme constructs are used: a single strand lipidated DNA inserted into the membrane that enables binding of an enzyme conjugated to a complementary DNA strand (DNA-APEX2) or a glycan-targeting binding group conjugated to horseradish peroxidase (WGA-HRP). Both tethered enzymes label proteins on the cell surface using a biotin substrate via a radical mechanism. The method provides significantly enhanced labeling efficiency and is much faster than traditional chemical labeling methods and methods that employ soluble enzymes. The authors comprehensively analyze the labeled proteins using mass spectrometry and find multiple proteins that were previously undetectable with chemical methods and soluble enzymes. Furthermore, they compare the labeling of both cells and the exosomes that are formed from the cells and characterize both up- and down-regulated proteins related to cancer development that may provide a mechanistic underpinning.

      Overall, the method is novel and should enable the discovery of many low-abundance cell-surface proteins through more efficient labeling. The DNA-APEX2 method will only be accessible to more sophisticated laboratories that can carry out the protocols but the WGA-HRP method employs a readily available commercial product and give equivalent, perhaps even better, results. In addition, the method cannot discriminate between proteins that are genuinely expressed on the cell from those that are non-specifically bound to the cell surface.

      The authors describe the approach and identify two unique proteins on the surface of prostate cell lines.

      Strengths:

      Good introduction with appropriate citations of relevant literature Much higher labeling efficiency and faster than chemical methods and soluble enzyme methods. Ability to detect low-abundance proteins, not accessible from previous labeling methods.

      Weaknesses: The DNA-APEX2 method requires specialized reagents and protocols that are much more challenging for a typical laboratory to carry out than conventional chemical labeling methods.

      The claims and findings are sound. The finding of novel proteins and the quantitative measurement of protein up- and down-regulation are important. The concern about non-specifically bound proteins could be addressed by looking at whether the detected proteins have a transmembrane region that would enable them to localize in the cell membrane.

      We thank the reviewer for recognizing the strengths and importance of this work. We also thank the reviewer for mentioning the issue of non-specifically bound proteins. As addressed above in the essential revisions sections, we believe that any low affinity, non-specific binding proteins are likely removed in the multiple wash/centrifugation steps on cells or the multiple centrifugation steps and sucrose gradient purification on EVs. Given the likelihood for removal of non-specific binders, we believe that the secreted proteins identified are likely high affinity interactions and their differential expression on either cells or EVs play an important part in the downstream biology of both sample types. However, the previous data presentation did not clarify which proteins pertained to the transmembrane plasma membrane proteome versus secreted protein forms. For further clarity in the data presentation (Figure 3D, 4D, 5D), we have bolded proteins that are also found in the SURFY database that only includes surface annotated proteins with a predicted transmembrane domain (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). We have also italicized proteins that are annotated to be secreted from the cell to the extracellular space (Uniprot classification). We have updated the text and caption as shown below:

      New Figure 3:

      Figure 3. WGA-HRP identifies a number of enriched markers on Myc-driven prostate cancer cells. (A) Overall scheme for biotin labeling, and label-free quantification (LFQ) by LC-MS/MS for RWPE-1 Control and Myc over-expression cells. (B) Microscopy image depicting morphological differences between RWPE-1 Control and RWPE-1 Myc cells after 3 days in culture. (C) Volcano plot depicting the LFQ comparison of RWPE-1 Control and Myc labeled cells. Red labels indicate upregulation in the RWPE-1 Control cells over Myc cells and green labels indicate upregulation in the RWPE-1 Myc cells over Control cells. All colored proteins are 2-fold enriched in either dataset between four replicates (two technical, two biological, p<0.05). (D) Heatmap of the 15 most upregulated transmembrane (bold) or secreted (italics) proteins in RWPE-1 Control and Myc cells. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/standard deviation. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 Control and Myc cells. (F) Upregulated proteins in RWPE-1 Myc cells (Myc, ANPEP, Vimentin, and FN1) are confirmed by western blot. (G) Upregulated surface proteins in RWPE-1 Myc cells (Vimentin, ANPEP, FN1) are detected by immunofluorescence microscopy. The downregulated protein HLA-B by Myc over-expression was also detected by immunofluorescence microscopy. All western blot images and microscopy images are representative of two biological replicates. Mass spectrometry data is based on two biological and two technical replicates (N = 4).

      New Figure 4:

      Figure 4. WGA-HRP identifies a number of enriched markers on Myc-driven prostate cancer EVs. (A) Workflow for small EV isolation from cultured cells. (B) Labeled proteins indicating canonical exosome markers (ExoCarta Top 100 List) detected after performing label-free quantification (LFQ) from whole EV lysate. The proteins are graphed from least abundant to most abundant. (C) Workflow of exosome labeling and preparation for mass spectrometry. (D) Heatmap of the 15 most upregulated proteins in RWPE-1 Control or Myc EVs. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/SD. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 Control and Myc cells. (F) Upregulated proteins in RWPE-1 Myc EVs (ANPEP and FN1) are confirmed by western blot. Mass spectrometry data is based on two biological and two technical replicates (N = 4). Due to limited sample yield, one replicate was performed for the EV western blot.

      New Figure 5:

      Figure 5. WGA-HRP identifies a number of EV-specific markers that are present regardless of oncogene status. (A) Matrix depicting samples analyzed during LFQ comparison--Control and Myc cells, as well as Control and Myc EVs. (B) Principle component analysis (PCA) of all four groups analyzed by LFQ. Component 1 (50.4%) and component 2 (15.8%) are depicted. (C) Functional annotation clustering was performed using DAVID Bioinformatics Resource 6.8 to classify the major constituents of component 1 in PCA analysis. (D) Heatmap of the 25 most upregulated proteins in RWPE-1 cells or EVs. Proteins are listed in decreasing order of expression with the most highly expressed proteins in EVs on the far left and the most highly expressed proteins in cells on the far right. Scale indicates intensity, defined as (LFQ Area - Mean LFQ Area)/SD. Extracellular proteins with annotated transmembrane domains are bolded and annotated secreted proteins are italicized. (E) Table indicating fold-change of most differentially regulated proteins by LC-MS/MS for RWPE-1 EVs compared to parent cells. (F) Western blot showing the EV specific marker ITIH4, IGSF8, and MFGE8.Mass spectrometry data is based on two biological and two technical replicates (N = 4). Due to limited sample yield, one replicate was performed for the EV western blot.

      Authors mention time-sensitive changes but it is unclear how this method would enable one to obtain this kind of data. How would this be accomplished? The statement "Due to the rapid nature of peroxidase enzymes (1-2 min), our approaches enable kinetic experiments to capture rapid changes, such as binding, internalization, and shuttling events." Yes, it is faster, but not sure I can think of an experiment that would enable one to capture such events.

      We thank the reviewer for this comment and giving us an opportunity to elaborate on the types of experiments enabled by this new method. A previous study (Y, Li et al. Rapid Enzyme-Mediated Biotinylation for Cell Surface Proteome Profiling. Anal. Chem. 2021) showed that labeling the cell surface with soluble HRP allowed the researchers to detect immediate surface protein changes in response to insulin treatment. They demonstrated differential surfaceome profiling changes at 5 minutes vs 2 hours following treatment with insulin. Only methods utilizing these rapid labeling enzymes could allow for this type of resolution. A few other biological settings that experience rapid cell surface changes are: response to drug treatment, T-cell activation and synapse formation (S, Valitutti, et al. The space and time frames of T cell activation at the immunological synapse. FEBS Letters. 2010) and GPCR activation (T, Gupte et al. Minute-scale persistence of a GPCR conformation state triggered by non-cognate G protein interactions primes signaling. Nat. Commun. 2019). We also believe the method would be useful for post-translational processes where proteins are rapidly shuttling to the cell surface. We have updated the discussion to elaborate on these types of experiments.

      "Due to the fast kinetics of peroxidase enzymes (1-2 min), our approaches could enable kinetic experiments to capture rapid post-translational trafficking of surfaces proteins, such as response to insulin, certain drug treatments, T-cell activation and synapse formation, and GPCR activation."

      The authors do not have any way to differentiate between proteins expressed by cells and presented on their membranes from proteins that non-specifically bind to the membrane surface. Non-specific binding (NSB) is not addressed. Proteins can non-specifically bind to the cell or EV surface. The results are obtained by comparisons (cells vs exosomes, controls vs cancer cells), which is fine because it means that what is being measured is differentially expressed, so even NSB proteins may be up- and down-regulated. But the proteins identified need to be confirmed. For example, are all the proteins being detected transmembrane proteins that are known to be associated with the membrane?

      As mentioned above, we utilized the most rigorous informatics analysis available (Uniprot and SURFY) to annotate the proteins we find as having a signal sequence and/or TM domain. Data shown in heatmaps are based off of significance (p < 0.05) across all four replicates, which supports that any secreted proteins present are likely due to actual biological differences between oncogenic status and/or sample origin (i.e. EV vs cell). We have addressed this point in a previous comment above.

      The term "extracellular vesicles" (EVs) might be more appropriate than "exosomes" to describe the studied preparation.

      As we describe above in response to earlier comments, we have systematically changed from using exosomes to small extracellular vesicles and better defined the isolation procedure that we used in the methods section.

      Reviewer #3 (Public Review):

      The article by Kirkemo et al explores approaches to analyse the surface proteome of cells or cell-derived extracellular vesicles (EVs, called here exosomes, but the more generic term "extracellular vesicles" would be more appropriate because the used procedure leads to co-isolation of vesicles of different origin), using tools to tether proximity-biotinylation enzymes to membranes. The authors determine the best conditions for surface labeling of cells, and demonstrate that tethering the enzymes (APEX or HRP) increases the number of proteins detected by mass-spectrometry. They further use one of the two approaches (where HRP binds to glycans), to analyse the biotinylated proteome of two variants of a prostate cancer cell line, and the corresponding EVs. The approaches are interesting, but their benefit for analysis of cells or EVs is not very strongly supported by the data.

      First, the authors honestly show (fig2-suppl figures) that only 35% of the proteins identified after biotinylation with their preferred tool actually correspond to annotated surface proteins. This is only slightly better than results obtained with a non-tethered sulfo-NHS-approach (30%).

      We thank the reviewer for this comment. The reason we utilize membrane protein enrichment methods is that membrane protein abundance is low compared to cytosolic proteins and their identification can be overwhelmed by cytosolic contaminants. Nonetheless, despite our best efforts to limit labeling to the membrane proteins, cytosolic proteins can carry over. Thus, we utilize informatics methods to identify the proteins that are annotated to be membrane associated. The Uniprot GOCC (Gene Ontology Cellular Component) Plasma Membrane database is the most inclusive of membrane proteins only requiring they contain either a signal sequence, transmembrane domain, GPI anchor or other membrane associated motifs yielding a total of 5,746 proteins. This will include organelle membrane proteins. It is known that proteins can traffic from the internal organelles to the cell surface so these can be bonified cell surface proteins too. To increase the informatics stringency for membrane proteins we have now applied a new database aggregated from work by the Wollscheid lab, called SURFY (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). This is a machine learning method trained on 735 high confidence membrane proteins from the Cell Surface Protein Atlas (CSPA). SURFY predicts a total of 2,886 cell surface proteins. When we filter our data using SURFY for proteins, peptides and label free quantitation (LFQ) area for three methods, we find that the difference between NHS-Biotin and WGA-HRP expands considerably (see new Figure 3-Supplemental Figure 1 below). We observe these differences when the datasets are searched with either the GOCC Plasma Membrane database or the entire human Uniprot database. The difference is especially large for LFQ analysis, which quantitatively scores peptide intensity as opposed to simply count the number hits as for protein and peptide analysis. Cytosolic carry over is the major disadvantage of NHS-Biotin, which suppresses signal strength and is reflected in the lower LFQ values (24% for NHS-biotin compared to 40% for WGA-HRP). We have updated the main text and supplemental figure below:

      "Both WGA-HRP and biocytin hydrazide had similar levels of cell surface enrichment on the peptide and protein level when cross-referenced with the SURFY curated database for extracellular surface proteins with a predicted transmembrane domain (Figure 3 - Figure supplement 1A). Sulfo-NHS-LC-LC-biotin and whole cell lysis returned the lowest percentage of cell surface enrichment, suggesting a larger portion of the total sulfo-NHS-LC-LC-biotin protein identifications were of intracellular origin, despite the use of the cell-impermeable format. These same enrichment levels were seen when the datasets were searched with the curated GOCC-PM database, as well as the Uniprot entire human proteome database (Figure 3 - Figure supplement 1B). Importantly, of the proteins quantified across all four conditions, biocytin hydrazide and WGA-HRP returned higher overall intensity values for SURFY-specified proteins than either sulfo-NHS-LC-LC-biotin or whole cell lysis. Importantly, although biocytin hydrazide shows slightly higher cell surface enrichment compared to WGA-HRP, we were unable to perform the comparative analysis at 500,000 cells--instead requiring 1.5 million--as the protocol yielded too few cells for analysis."

      Figure 3-Figure Supplement 1. Comparison of surface enrichment between replicates for different mass spectrometry methods. (A) The top three methods (NHS-Biotin, Biocytin Hydrazide, and WGA-HRP) were compared for their ability to enrich cell surface proteins on 1.5 M RWPE-1 Control cells by LC-MS/MS after being searched with the Uniprot GOCC Plasma Membrane database. Shown are enrichment levels on the protein, peptide, and average MS1 intensity of top three peptides (LFQ area) levels. (B) The top three methods (NHS-Biotin, Biocytin Hydrazide, and WGA-HRP) were compared for their ability to enrich cell surface proteins on 1.5 M RWPE-1 Control cells by LC-MS/MS after being searched with the entire human Uniprot database. Shown are enrichment levels on the protein, peptide, and average MS1 intensity of top three peptides (LFQ area) levels. Proteins or peptides detected from cell surface annotated proteins (determined by the SURFY database) were divided by the total number of proteins or peptides detected. LFQ areas corresponding to cell surface annotated proteins (SURFY) were divided by the total area sum intensity for each sample. The corresponding percentages for two biological replicates were plotted.

      There are additional advantages to WGA-HRP over NHS-biotin. These include: (i) labeling time is 2 min versus 30 min, which would afford higher kinetic resolution as needed, and (ii) the NHS-biotin labels lysines, which hinders tryptic cleavage and downstream peptide analysis, whereas the WGA-HRP labels tyrosines, eliminating impacts on tryptic patterns. WGA-HRP is slightly below biocytin hydrazide in peptide and protein ID and somewhat more by LFQ. However, there are significant advantages over biocytin hydrazide: (i) sample size for WGA-HRP can be reduced a factor of 3-5 because of cell loss during the multiple washing steps after periodate oxidation and hydrazide labeling, (ii) the time of labeling is dramatically reduced from 3 hr for hydrazide to 2 min for WGA-HRP, and (iii) the HRP enzyme has a large labeling diameter (20-40 nm, but also reported up to 200 nm) and can label non-glycosylated membrane proteins as opposed to biocytin hydrazide that only labels glycosylated proteins. The hydrazide method is the current standard for membrane protein enrichment, and we feel that the WGA-HRP will compete especially when cell sample size is limited or requires special handling. In the case of EVs, we were not able to perform hydrazide labeling due to the two-step process and small sample size.

      Indeed the list of identified proteins in figures 4 and 5 include several proteins whose expected subcellular location is internal, not surface exposed, and whose location in EVs should also be inside (non-exhaustively: SDCBP = syntenin, PDCD6IP = Alix, ARRDC1, VPS37B, NUP35 = nucleopore protein)…

      We thank the reviewer for this comment. We have elaborated on this point in a number of response paragraphs above. The proteins that the reviewer points out are annotated as “plasma membrane” in the very inclusive GOCC plasma membrane database. However, this means that they may also spend time in other locations in the cell or reside on organelle membranes. We have done further analysis to remove any intracellular membrane residing proteins that are included in the GOCC plasma membrane database, including the five proteins mentioned above. We also have further highlighted proteins that appear in the SURFY database, as discussed above and in our response to Reviewer 2’s comment. To increase stringency, we have bolded proteins that are found in the more selective SURFY database and italicized secreted proteins. Due to our new analysis and data presentation, it is more clear which markers are bona fide extracellular resident membrane proteins. We have updated the Figures and Figure legends as mentioned above, as well as added this statement in the Data Processing and Analysis methods:

      "Additionally, to not miss any key surface markers such as secreted proteins or anchored proteins without a transmembrane domain, we chose to initially avoid searching with a more stringent protein list, such as the curated SURFY database. However, following the analysis, we bolded proteins found in the SURFY database and italicized proteins known to be secreted (Uniprot)."

      The membrane proteins identified as different between the control and Myc-overexpressing cells or their EVs, would have been identified as well by a regular proteomic analysis.

      To directly compare surfaceomes of EVs to cells, we are compelled to use the same proteomic method. For parental cell surfaceomic analysis, a membrane enrichment method is required due to the high levels of cytosolic proteins that swamp out signal from membrane proteins. Although EVs have a higher proportion of membrane to cytosol, whole EV proteomics would still have significant cytosolic contamination.

      Second, the title highlights the benefit of the technique for small-scale samples: this is demonstrated for cells (figures 1-2), but not for EVs: no clear quantitative indication of amount of material used is provided for EV samples. Furthermore, no comparison with other biotinylation technics such as sulfo-NHS is provided for EVs/exosomes. Therefore, it is difficult to infer the benefit of this technic applied to the analysis of EVs/exosomes.

      We appreciate the reviewer for this comment. We have updated the methods as mentioned above in our response to the Essential Revisions. In brief, the yield of EVs post-sucrose gradient isolation was 3-5 µg of protein from 16x15 cm2 plates of cells, totaling 240 mL of media. Since we had previously demonstrated that our method was superior to sulfo-NHS for enriching surface proteins on cells, we proceeded to use the WGA-HRP for the EV labeling experiments.

      In addition, the WGA-based tethering approach, which is the only one used for the comparative analysis of figures 4 and 5, possibly induces a bias towards identification of proteins with a particular glycan signature: a novelty would possibly have come from a comparison of this approach with the other initially evaluated, the DNA-APEX one, where tethering is induced by lipid moieties, thus should not depend on glycans. The authors may have then identified by LC-MS/MS specific glycan-associated versus non-glycan-associated proteins in the cells or EVs membranes. Also ideally, the authors should have compared the 4 combinations of the 2 enzymes (APEX and HRP) and 2 tethers (lipid-bound DNA and WGA) to identify the bias introduced by each one.

      We thank the reviewer for this comment. We performed analysis to determine whether there was a bias towards Uniprot annotated “Glyco” vs “Non-Glyco” surface proteins within the SURFY database identified across the WGA-HRP, APEX2-DNA, APEX2, and HRP labeling methods. We performed this analysis by measuring the total LFQ area detected for each category (glycoprotein vs non-glycoprotein) and dividing that by the total LFQ area found across all proteins detected in the sample. We found similar normalized areas of non-glyco surface proteins between WGA-HRP and APEX2-DNA suggesting there is not a bias against non-glycosylated proteins in the WGA-HRP sample. There were slightly elevated levels of Glycoproteins in the WGA-HRP sample over APEX2-DNA. It is not surprising to us that there is little bias because the free-radicals generated by biotin-tyramide can label over tens of nanometers and thus can label not just the protein they are attached to, but neighbors also, regardless of glycosylation status. We have added this as Figure 2-Supplement 3, and amended the text in the manuscript below in purple.

      Figure 2 – Figure Supplement 3: Comparison of enrichment of Glyco- vs Non-Glyco-proteins. (A) TIC area of Uniprot annotated Glycoproteins compared to Non-Glycoproteins in the SURFY database for each labeling method compared to total TIC area. There was not a significant difference in detection of Non-Glycoproteins detected between WGA-HRP and APEX2-DNA and only a slightly higher detection of Glycoproteins in the WGA-HRP sample over APEX2-DNA.

      "As the mode of tethering WGA-HRP involves GlcNAc and sialic acid glycans, we wanted to determine whether there was a bias towards Uniprot annotated 'Glycoprotein' vs 'Non-Glycoprotein' surface proteins identified across the WGA-HRP, APEX2-DNA, APEX2, and HRP labeling methods. We looked specifically looked at surface proteins founds in the SURFY database, which is the most restrictive surface database and requires that proteins have a predicted transmembrane domain (Bausch-Fluck et al., The in silico human surfaceome. PNAS. 2018). We performed this analysis by measuring the average MS1 intensity across the top three peptides (area) for SURFY glycoproteins and non-glycoproteins for each sample and dividing that by the total LFQ area found across all GOCC annotated membrane proteins detected in each sample. We found similar normalized areas of non-glyco surface proteins across all samples (Figure 2 - Figure supplement 4). If a bias existed towards glycosylated proteins in WGA-HRP compared to the glycan agnostic APEX2-DNA sample, then we would have seen a larger percentage of non-glycosylated surface proteins identified in APEX2-DNA over WGA-HRP. Due to the large labeling radius of the HRP enzyme, we find it unsurprising that the WGA-HRP method is able to capture non-glycosylated proteins on the surface to the same degree (Rees et al. Selective Proteomic Proximity Labeling Assay SPPLAT. Current Protocols in Protein Science. 2015). There is a slight increase in the area percentage of glycoproteins detected in the WGA-HRP compared to the APEX2-DNA sample but this is likely due to the fact that a greater number of surface proteins in general are detected with WGA-HRP."

      As presented the article is thus an interesting technical description, which does not convince the reader of its benefit to use for further proteomic analyses of EVs or cells. Such info is of course interesting to share with other scientists as a sort of "negative" or "neutral" result. Maybe a novelty of the presented work is the differential proteome analysis of surface enriched EV/cell proteins in control versus myc-expressing cells. Such analyses of EVs from different derivatives of a tumor cell line have been performed before, for instance comparing cells with different K-Ras mutations (Demory-Beckler, Mol Cell proteomics 2013 # 23161513). However, here the authors compare also cells and EVs, and find possibly interesting discrepancies in the upregulated proteins. These results could probably be exploited more extensively. For instance, authors could give clearer info (lists) on the proteins differentially regulated in the different comparisons: in EVs from both cells, in EVs vs cells, in both cells.

      We appreciate the reviewer for this critique and have updated the manuscript accordingly. We have changed the title to “Cell surface tethered promiscuous biotinylators enable small-scale comparative surface proteomic analysis of human extracellular vesicles and cells” to more accurately depict the focus of our manuscript which, as the reviewer highlighted, is that this technology allows for comparative analysis between the surfaceomes of cells vs EVs. We appreciate the fine work from the Coffey lab on whole EV analysis of KRAS transformed cells. They identified a mix of surface and cytosolic proteins that change in EVs from the transformed cells, whereas our data focuses specifically on the surfaceome differences in Myc transformed and non-transformed cells and corresponding small EVs. We believe this makes important contributions to the field as well.

      To further address the reviewer’s suggestions, we additionally have significantly reorganized the figures to better display the differentially regulated proteins. We have removed the volcano plots and instead included heatmaps with the top 30 (Figure 3 and Figure 4) and top 50 (Figure 5) differentially regulated proteins across cells and EVs. We have also updated the lists of proteins in the supplemental source tables section. See responses to Reviewer 2 above for the updates to Figures 3-5. We have additionally included supplemental figures with lists of differentially upregulated proteins in the EV and Cell samples, which are shown below:

      Figure 3 – Supplement 3: List of proteins comparing enriched targets (>2-fold) in Myc cells versus Control cells. Targets that were found enriched (Myc/Control) in the Control cells (left) and Myc cells (right). The fold-change between Myc cells and Control cells is listed in the column to the right of the gene name.

      Figure 4 – Supplement 1: List of proteins comparing enriched targets (>1.5-fold) in Myc EVs versus Control EVs. Targets that were found enriched (Myc/Control) in the Control EVs (left) and Myc EVs (right). The fold-change between Myc EVs and Control EVs is listed in the column to the right of the gene name.

      Figure 4 – Figure Supplement 2: Venn diagram comparing enriched targets (>2-fold) in Cells and EVs. (A) Targets that were found enriched in the Control EVs (purple) and Control cells (blue) when each is separately compared to Myc EVs and Myc cells, respectively. The 5 overlapping enriched targets in common between Control cells and Control EVs are listed in the center. (B) Targets that were found enriched in the Myc EVs (purple) and Myc cells (blue) when each is separately compared to Control EVs and Control cells, respectively. The 12 overlapping enriched targets in common between Myc cells and Myc EVs are listed in the center.

      Figure 5 - Supplement 1: List of proteins comparing enriched targets (>2-fold) in Control EVs versus Control cells and Myc EVs versus Myc cells. (A)Targets that were found enriched (EV/cell) in the Control samples are listed. The fold-change values between Control EVs and Control cells are listed in the column to the right of the gene name. (B)Targets that were found enriched (EV/cell) in the Myc samples are listed. The fold-change values between Myc EVs and Myc cells are listed in the column to the right of the gene name.

    1. Author Response:

      Reviewer #1:

      Charpentier et al. use facial recognition technology to show that mothers in a group of mandrills lead their offspring to associate with phenotypically similar offspring. Mandrills are a species of primate that live in large, matrilineal troops, with a single, dominant male that fathers the majority of the offspring. Male breeder turnover and extra-pair mating by females can lead to variation in relatedness between group members and the potential for kin-selected benefits from preferentially cooperating with closer relatives within the group. The authors argue that the strategy of influencing the social network of their offspring could be favoured by "second-order kin selection", a mechanism by which inclusive fitness benefits are accrued to female actors through kin-selected benefits to their offspring. This interpretation is supported by a theoretical model.

      The paper highlights a previously unappreciated mechanism for favouring association between non-kin in social groups and also contributes a nice insight into the complexity of social interactions in a relatively understudied wild primate species. The conclusions are strengthened by data showing associations between mothers were not influenced by the facial similarity of their offspring -- this suggests that mothers are making decisions based on the appearance of offspring and not their mothers.

      Some remaining questions regarding the strength of the authors' interpretation exist: Given the challenges of studying mandrills in the field, the fact that the study reports data from a single group is understandable but potential issues remain with the independence of data points. There may be an additional issue arising from the fact that this troop is semi-captive.

      The study group is not semi-captive. Instead, it originated from two release events of a few captive individuals into the wild (in 2002 and 2006). The population is now composed of more than 250 individuals and all of them, except for 7 founder females (<3%), were born in the wild. In addition, the study group is not fed and occasionally wanders into a fenced protected area. Fences of the park do not represent a boundary for mandrills and most of the time (c.a. 80% of days), the study group ranges outside the park. We have clarified this misunderstanding.

      Regarding the independence of data points, we would be grateful if this reviewer could clarify her/his thoughts. As a tentative response, we indeed have access to a single (although large) study group, but that’s unfortunately often the case when studying primates or other large mammals. Regarding our study questions, we have clearly demonstrated increased nepotism among paternally related mandrills in two different social groups (Charpentier et al. 2007: semi-captive mandrills; Charpentier et al. 2020: wild mandrills). More generally, we do not see any parsimonious explanations for why the studied mandrills would behave or experienced selective pressures that may have differently shaped their genetic structure and social organization compared to other wild mandrill groups.

      The number of genotyped offspring is relatively small (n = 15) and paternity is inferred from the identity of the dominant male. However, the authors also refer to the fact that it's normal for female mandrills to mate with several males during ovulation.

      Indeed, both sexes mate promiscuously during the mating season. We have very recently (June 2022) obtained new genetic profiles for a subset of the study infants (it took two years to obtain these data). We have now increased our sample size of infants with a known father, from 15 to 32. With these new data, we were able to distinguish between four categories of infant-infant dyads: those sharing the same father (PHS), those not sharing the same father (not PHS), those conceived during the same alpha male tenure, and those that were not (both infants with unknown dads). The graph below shows the average facial distance among individuals for each of these four categories. It shows that infants conceived during the same alpha male tenure are significantly more similar to each other than infants sired by different fathers or during the tenure of different alpha males, but they are also significantly less similar to each other than infants born to the same father (the four categories are all significantly different from each other, except when comparing infants born to different fathers with those conceived during different alpha male tenures). As suggested by this reviewer, the fact that females mate predominantly with the alpha male, but to some extent also with other males, likely explains the difference between “same father” and “same alpha male tenure”. Importantly, however, considering all infants conceived during the same alpha male tenure as “PHS” is highly conservative. It is thus likely that knowing the paternity of every infant would produce even clearer effects (and indeed, increasing the data set from 15 to 32 strengthened this result). We have now updated this result (first model) based on this new sample.

      What evidence is there to support a beneficial effect of nepotism in this species?

      In mandrills, females who affiliate more (groom more/associate more) with their groupmates (kin or non-kin) during juvenility also reproduce 1 year earlier than those females that are poorly socially integrated (Charpentier et al. 2012). These results are similar to what is known in many mammalian species (see for review Snyder-Mackler et al. 2020). However, the positive effects of a rich social life are generally triggered by all group members, not only close kin. However, if beneficial social relationships impact the direct fitness of individuals, as reported in mandrills and other species, then kin selection theory predicts that these effects should further translate into indirect fitness benefits.

      We have now added this relevant reference (Charpentier et al. 2012) in the revised version of our manuscript and present the results of this early study on mandrills.

      What form could nepotism take and does it necessarily have to involve full sibs?

      We are unsure why this reviewer is mentioning full-sibs here. For this reviewer information, on the 2556 study dyads (model 1 on the impact of maternal and paternal origins on facial distance), only one dyad was a full-sib pair. Full-sibs are therefore very rare in the study population due to male migration patterns and generally short alpha male tenures.

      If a female did not associate with offspring as shown here, would nepotistic interactions simply arise between her offspring and offspring that were less facially similar?

      We guess that facial similarity would not be a predictor of spatial association anymore. Indeed, we think that young mandrills do not use self-referent phenotype matching, precluding the self-evaluation of those infants that look like them. However, as stated below, we cannot fully exclude the possibility that other social partners, such as fathers, may also influence infant-infant relationships, although we think that this alternative mechanism is less parsimonious than the one we propose and test.

      Reviewer #2:

      This paper uses data on patterns of spatial association and facial similarity in mandrills to develop a new hypothesis for the evolution of kin recognition based on facial cues. Previous work on this system has shown that, among females, paternal half-sibs resemble each other visually more than maternal half-sisters do. The authors hypothesise that this paternally inherited facial similarity provides opportunities for kin selection, but it is unclear how offspring themselves could recognise kin using phenotype matching since they are unable to see their own face. One answer to this puzzle is that third parties -- mothers -- may promote social interactions between their own offspring and other offspring that resemble them since these other offspring are likely to share the same father. In support of this hypothesis, the authors find that mothers and offspring show spatial proximity to infants that are facially more similar than average. They also use an analytical evolutionary model to confirm the logic of this hypothesis. The model shows that mothers can gain inclusive fitness benefits by encouraging reciprocal social interaction among their offspring and other paternally-related offspring. They term this idea 'second-order' kin selection and identify a range of other circumstances in which it might play an important role in shaping the evolution of social behaviour.

      The main strengths of the paper are the interesting mandrill data and the cutting-edge methods used to analyse facial similarity, which have stimulated the development of a theoretically interesting hypothesis about the evolution of facially based kin recognition. The theoretical model enhances the generality and rigour of the work. The paper will be of wide interest and the concept of second-order kin selection may be applicable to other social circumstances, such as interactions among in-laws in close-knit family groups. Thus, I can see that this paper will be a stimulus for future work.

      We are grateful for these positive comments.

      The data are, I think, rather overinterpreted in terms of the degree to which they support the hypothesis. The spatial proximity data are interesting, but on their own, they are not definitive support for the hypothesis or model. A more critical approach to the hypothesis, clearly setting out the limitations of the data, and what tests in future could be used to falsify the hypothesis or model, would make for a stronger paper.

      We agree with this general comment and have addressed it by 1. Adding a model on grooming relationships between females and infants, 2. Toning down our interpretation throughout the manuscript and 3. Propose future directions of research.

      Overall the authors have presented data that support a fascinating new mechanism by which natural selection can influence social interactions among the members of family groups, in potentially surprising ways. I also find it remarkable that 60 years after the development of the kin selection theory new implications of this theory are still being uncovered. The concept of second-order kin selection may prove important in understanding the evolution of social organisation and behaviour in species that live in groups containing a mixture of kin and non-kin, such as many primates and of course humans.

      We are grateful to this reviewer for this very positive comment. We fully agree with the fact that 60 years after the kin selection theory has emerged, we are still discovering further implications!

      Reviewer #3:

      This is a very interesting and impressive manuscript. It is complex in its multiple components, and in some ways that makes it a difficult manuscript to evaluate. There is a lot in it, including empirical analyses of a face dataset and of behavioral association data, combined with a theoretical model.

      We are very grateful for this positive comment and are glad that you liked our manuscript.

      The three main findings are: 1) Paternal siblings look alike (similar to, and building on, a recent manuscript the authors published elsewhere); 2) Infants that are more facially similar tend to associate; and 3) mothers tend to be found in association with other unrelated infants that look more like their own infants. Such results are interesting, and indeed one potential interpretation, perhaps even the most likely, is that mothers are behaving in such a way that promotes association between their own infants and the paternal kin of their infants.

      Nonetheless, the evidence provided is logically only consistent with the authors' hypothesis, rather than being strong direct evidence for it. As such, the current framing and indeed the title, "Primate mothers promote proximity between their offspring and infants who look like them", are both problematic. (In addition, the title should be about mandrills, not "primates", since this manuscript does not provide evidence from any other species.) The evidence provided is consistent with the hypothesis, but also consistent with other potential hypotheses. The evidence given to dismiss other potential hypotheses is not strong, and rests on the fact that many males are not around all year to influence things, and that "males that were present during a given reproductive cycle are not responsible for maintaining proximity with either infants or their mothers (MJEC and BRT, pers. obs.)".

      We agree with this comment. Although, after examining several alternative mechanisms, in the light of the natural history of mandrills we are confident that the proposed mechanism is at work in that species, although we cannot firmly exclude some of these alternative mechanisms. To address this comment, we have changed the title of our manuscript that now reads “Mandrill mothers associate with infants who look like their own offspring using phenotype matching”. We have also included an additional model on grooming relationships (see response to R1) and have toned down the interpretation of our results throughout our revised manuscript. Finally, we have further discussed alternative scenario, in particular the one involving fathers (see details above).

      My opinion is that these are really interesting analyses and data, which are being somewhat undermined by the insistence that only one hypothesis can explain the observed association patterns. It could easily be presented differently, as a demonstration that paternal siblings look alike and that they associate. The authors could then go on to explore different possible explanations for this using their association data, make the case that maternal behavior is the most plausible (but not the only) explanation, and present their model of how such behavior could bring fitness benefits.

      In my view, such a presentation would be both more cautious and more appropriate, without in any way reducing the impact or importance of the data. In the current iteration, I think there are issues because the data do not provide sufficient support for the surety of the title and conclusion, as presented.

      We think that the current organization of our manuscript was not that different from the one proposed here and follows a reasoning already proposed in a former manuscript (Charpentier et al. 2020). Indeed, we first start by reminding the reader what we already know from that previous studies: paternal siblings look alike and they associate. We then go on exploring different mechanisms. That being said, and as suggested, we have been more cautious in interpreting our results, that are indeed only correlative.

    1. Author Response

      Reviewer #1 (Public Review):

      In this work George et al. describe RatInABox, a software system for generating surrogate locomotion trajectories and neural data to simulate the effects of a rodent moving about an arena. This work is aimed at researchers that study rodent navigation and its neural machinery.

      Strengths:

      • The software contains several helpful features. It has the ability to import existing movement traces and interpolate data with lower sampling rates. It allows varying the degree to which rodents stay near the walls of the arena. It appears to be able to simulate place cells, grid cells, and some other features.

      • The architecture seems fine and the code is in a language that will be accessible to many labs.

      • There is convincing validation of velocity statistics. There are examples shown of position data, which seem to generally match between data and simulation.

      Weaknesses:

      • There is little analysis of position statistics. I am not sure this is needed, but the software might end up more powerful and the paper higher impact if some position analysis was done. Based on the traces shown, it seems possible that some additional parameters might be needed to simulate position/occupancy traces whose statistics match the data.

      Thank you for this suggestion. We have added a new panel to figure 2 showing a histogram of the time the agent spends at positions of increasing distance from the nearest wall. As you can see, RatInABox is a good fit to the real locomotion data: positions very near the wall are under-explored (in the real data this is probably because whiskers and physical body size block positions very close to the wall) and positions just away from but close to the wall are slightly over explored (an effect known as thigmotaxis, already discussed in the manuscript).

      As you correctly suspected, fitting this warranted a new parameter which controls the strength of the wall repulsion, we call this “wall_repel_strength”. The motion model hasn’t mathematically changed, all we did was take a parameter which was originally a fixed constant 1, unavailable to the user, and made it a variable which can be changed (see methods section 6.1.3 for maths). The curves fit best when wall_repel_strength ~= 2. Methods and parameters table have been updated accordingly. See Fig. 2e.

      • The overall impact of this work is somewhat limited. It is not completely clear how many labs might use this, or have a need for it. The introduction could have provided more specificity about examples of past work that would have been better done with this tool.

      At the point of publication we, like yourself, also didn’t know to what extent there would be a market for this toolkit however we were pleased to find that there was. In its initial 11 months RatInABox has accumulated a growing, global user base, over 120 stars on Github and north of 17,000 downloads through PyPI. We have accumulated a list of testimonials[5] from users of the package vouching for its utility and ease of use, four of which are abridged below. These testimonials come from a diverse group of 9 researchers spanning 6 countries across 4 continents and varying career stages from pre-doctoral researchers with little computational exposure to tenured PIs. Finally, not only does the community use RatInABox they are also building it: at the time of writing RatInABx has received logged 20 GitHub “Issues” and 28 “pull requests” from external users (i.e. those who aren’t authors on this manuscript) ranging from small discussions and bug-fixes to significant new features, demos and wrappers.

      Abridged testimonials:

      ● “As a medical graduate from Pakistan with little computational background…I found RatInABox to be a great learning and teaching tool, particularly for those who are underprivileged and new to computational neuroscience.” - Muhammad Kaleem, King Edward Medical University, Pakistan

      ● “RatInABox has been critical to the progress of my postdoctoral work. I believe it has the strong potential to become a cornerstone tool for realistic behavioural and neuronal modelling” - Dr. Colleen Gillon, Imperial College London, UK

      ● “As a student studying mathematics at the University of Ghana, I would recommend RatInABox to anyone looking to learn or teach concepts in computational neuroscience.” - Kojo Nketia, University of Ghana, Ghana

      ● “RatInABox has established a new foundation and common space for advances in cognitive mapping research.” - Dr. Quinn Lee, McGill, Canada

      The introduction continues to include the following sentence highlighting examples of past work which relied of generating artificial movement and/or neural dat and which, by implication could have been done better (or at least accelerated and standardised) using our toolbox.

      “Indeed, many past[13, 14, 15] and recent[16, 17, 18, 19, 6, 20, 21] models have relied on artificially generated movement trajectories and neural data.”

      • Presentation: Some discussion of case studies in Introduction might address the above point on impact. It would be useful to have more discussion of how general the software is, and why the current feature set was chosen. For example, how well does RatInABox deal with environments of arbitrary shape? T-mazes? It might help illustrate the tool's generality to move some of the examples in supplementary figure to main text - or just summarize them in a main text figure/panel.

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including T-mazes), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      To further illustrate the tools generality beyond the structure of the environment we continue to summarise the reinforcement learning example (Fig. 3e) and neural decoding example in section 3.1. In addition to this we have added three new panels into figure 3 highlighting new features which, we hope you will agree, make RatInABox significantly more powerful and general and satisfy your suggestion of clarifying utility and generality in the manuscript directly.

      On the topic of generality, we wrote the manuscript in such a way as to demonstrate how the rich variety of ways RatInABox can be used without providing an exhaustive list of potential applications. For example, RatInABox can be used to study neural decoding and it can be used to study reinforcement learning but not because it was purpose built with these use-cases in mind. Rather because it contains a set of core tools designed to support spatial navigation and neural representations in general. For this reason we would rather keep the demonstrative examples as supplements and implement your suggestion of further raising attention to the large array of tutorials and demos provided on the GitHub repository by modifying the final paragraph of section 3.1 to read:

      “Additional tutorials, not described here but available online, demonstrate how RatInABox can be used to model splitter cells, conjunctive grid cells, biologically plausible path integration, successor features, deep actor-critic RL, whisker cells and more. Despite including these examples we stress that they are not exhaustive. RatInABox provides the framework and primitive classes/functions from which highly advanced simulations such as these can be built.”

      Reviewer #3 (Public Review):

      George et al. present a convincing new Python toolbox that allows researchers to generate synthetic behavior and neural data specifically focusing on hippocampal functional cell types (place cells, grid cells, boundary vector cells, head direction cells). This is highly useful for theory-driven research where synthetic benchmarks should be used. Beyond just navigation, it can be highly useful for novel tool development that requires jointly modeling behavior and neural data. The code is well organized and written and it was easy for us to test.

      We have a few constructive points that they might want to consider.

      • Right now the code only supports X,Y movements, but Z is also critical and opens new questions in 3D coding of space (such as grid cells in bats, etc). Many animals effectively navigate in 2D, as a whole, but they certainly make a large number of 3D head movements, and modeling this will become increasingly important and the authors should consider how to support this.

      Agents now have a dedicated head direction variable (before head direction was just assumed to be the normalised velocity vector). By default this just smoothes and normalises the velocity but, in theory, could be accessed and used to model more complex head direction dynamics. This is described in the updated methods section.

      In general, we try to tread a careful line. For example we embrace certain aspects of physical and biological realism (e.g. modelling environments as continuous, or fitting motion to real behaviour) and avoid others (such as the biophysics/biochemisty of individual neurons, or the mechanical complexities of joint/muscle modelling). It is hard to decide where to draw but we have a few guiding principles:

      1. RatInABox is most well suited for normative modelling and neuroAI-style probing questions at the level of behaviour and representations. We consciously avoid unnecessary complexities that do not directly contribute to these domains.

      2. Compute: To best accelerate research we think the package should remain fast and lightweight. Certain features are ignored if computational cost outweighs their benefit.

      3. Users: If, and as, users require complexities e.g. 3D head movements, we will consider adding them to the code base.

      For now we believe proper 3D motion is out of scope for RatInABox. Calculating motion near walls is already surprisingly complex and to do this in 3D would be challenging. Furthermore all cell classes would need to be rewritten too. This would be a large undertaking probably requiring rewriting the package from scratch, or making a new package RatInABox3D (BatInABox?) altogether, something which we don’t intend to undertake right now. One option, if users really needed 3D trajectory data they could quite straightforwardly simulate a 2D Environment (X,Y) and a 1D Environment (Z) independently. With this method (X,Y) and (Z) motion would be entirely independent which is of unrealistic but, depending on the use case, may well be sufficient.

      Alternatively, as you said that many agents effectively navigate in 2D but show complex 3D head and other body movements, RatInABox could interface with and feed data downstream to other softwares (for example Mujoco[11]) which specialise in joint/muscle modelling. This would be a very legitimate use-case for RatInABox.

      We’ve flagged all of these assumptions and limitations in a new body of text added to the discussion:

      “Our package is not the first to model neural data[37, 38, 39] or spatial behaviour[40, 41], yet it distinguishes itself by integrating these two aspects within a unified, lightweight framework. The modelling approach employed by RatInABox involves certain assumptions:

      1. It does not engage in the detailed exploration of biophysical[37, 39] or biochemical[38] aspects of neural modelling, nor does it delve into the mechanical intricacies of joint and muscle modelling[40, 41]. While these elements are crucial in specific scenarios, they demand substantial computational resources and become less pertinent in studies focused on higher-level questions about behaviour and neural representations.

      2. A focus of our package is modelling experimental paradigms commonly used to study spatially modulated neural activity and behaviour in rodents. Consequently, environments are currently restricted to being two-dimensional and planar, precluding the exploration of three-dimensional settings. However, in principle, these limitations can be relaxed in the future.

      3. RatInABox avoids the oversimplifications commonly found in discrete modelling, predominant in reinforcement learning[22, 23], which we believe impede its relevance to neuroscience.

      4. Currently, inputs from different sensory modalities, such as vision or olfaction, are not explicitly considered. Instead, sensory input is represented implicitly through efficient allocentric or egocentric representations. If necessary, one could use the RatInABox API in conjunction with a third-party computer graphics engine to circumvent this limitation.

      5. Finally, focus has been given to generating synthetic data from steady-state systems. Hence, by default, agents and neurons do not explicitly include learning, plasticity or adaptation. Nevertheless we have shown that a minimal set of features such as parameterised function-approximator neurons and policy control enable a variety of experience-driven changes in behaviour the cell responses[42, 43] to be modelled within the framework.

      • What about other environments that are not "Boxes" as in the name - can the environment only be a Box, what about a circular environment? Or Bat flight? This also has implications for the velocity of the agent, etc. What are the parameters for the motion model to simulate a bat, which likely has a higher velocity than a rat?

      Thank you for this question. Since the initial submission of this manuscript RatInABox has been upgraded and environments have become substantially more “general”. Environments can now be of arbitrary shape (including circular), boundaries can be curved, they can contain holes and can also contain objects (0-dimensional points which act as visual cues). A few examples are showcased in the updated figure 1 panel e.

      Whilst we don’t know the exact parameters for bat flight users could fairly straightforwardly figure these out themselves and set them using the motion parameters as shown in the table below. We would guess that bats have a higher average speed (speed_mean) and a longer decoherence time due to increased inertia (speed_coherence_time), so the following code might roughly simulate a bat flying around in a 10 x 10 m environment. Author response image 1 shows all Agent parameters which can be set to vary the random motion model.

      Author response image 1.

      • Semi-related, the name suggests limitations: why Rat? Why not Agent? (But its a personal choice)

      We came up with the name “RatInABox” when we developed this software to study hippocampal representations of an artificial rat moving around a closed 2D world (a box). We also fitted the random motion model to open-field exploration data from rats. You’re right that it is not limited to rodents but for better or for worse it’s probably too late for a rebrand!

      • A future extension (or now) could be the ability to interface with common trajectory estimation tools; for example, taking in the (X, Y, (Z), time) outputs of animal pose estimation tools (like DeepLabCut or such) would also allow experimentalists to generate neural synthetic data from other sources of real-behavior.

      This is actually already possible via our “Agent.import_trajectory()” method. Users can pass an array of time stamps and an array of positions into the Agent class which will be loaded and smoothly interpolated along as shown here in Fig. 3a or demonstrated in these two new papers[9,10] who used RatInABox by loading in behavioural trajectories.

      • What if a place cell is not encoding place but is influenced by reward or encodes a more abstract concept? Should a PlaceCell class inherit from an AbstractPlaceCell class, which could be used for encoding more conceptual spaces? How could their tool support this?

      In fact PlaceCells already inherit from a more abstract class (Neurons) which contains basic infrastructure for initialisation, saving data, and plotting data etc. We prefer the solution that users can write their own cell classes which inherit from Neurons (or PlaceCells if they wish). Then, users need only write a new get_state() method which can be as simple or as complicated as they like. Here are two examples we’ve already made which can be found on the GitHub:

      Author response image 2.

      Phase precession: PhasePrecessingPlaceCells(PlaceCells)[12] inherit from PlaceCells and modulate their firing rate by multiplying it by a phase dependent factor causing them to “phase precess”.

      Splitter cells: Perhaps users wish to model PlaceCells that are modulated by recent history of the Agent, for example which arm of a figure-8 maze it just came down. This is observed in hippocampal “splitter cell”. In this demo[1] SplitterCells(PlaceCells) inherit from PlaceCells and modulate their firing rate according to which arm was last travelled along.

      • This a bit odd in the Discussion: "If there is a small contribution you would like to make, please open a pull request. If there is a larger contribution you are considering, please contact the corresponding author3" This should be left to the repo contribution guide, which ideally shows people how to contribute and your expectations (code formatting guide, how to use git, etc). Also this can be very off-putting to new contributors: what is small? What is big? we suggest use more inclusive language.

      We’ve removed this line and left it to the GitHub repository to describe how contributions can be made.

      • Could you expand on the run time for BoundaryVectorCells, namely, for how long of an exploration period? We found it was on the order of 1 min to simulate 30 min of exploration (which is of course fast, but mentioning relative times would be useful).

      Absolutely. How long it takes to simulate BoundaryVectorCells will depend on the discretisation timestep and how many neurons you simulate. Assuming you used the default values (dt = 0.1, n = 10) then the motion model should dominate compute time. This is evident from our analysis in Figure 3f which shows that the update time for n = 100 BVCs is on par with the update time for the random motion model, therefore for only n = 10 BVCs, the motion model should dominate compute time.

      So how long should this take? Fig. 3f shows the motion model takes ~10-3 s per update. One hour of simulation equals this will be 3600/dt = 36,000 updates, which would therefore take about 72,000*10-3 s = 36 seconds. So your estimate of 1 minute seems to be in the right ballpark and consistent with the data we show in the paper.

      Interestingly this corroborates the results in a new inset panel where we calculated the total time for cell and motion model updates for a PlaceCell population of increasing size (from n = 10 to 1,000,000 cells). It shows that the motion model dominates compute time up to approximately n = 1000 PlaceCells (for BoundaryVectorCells it’s probably closer to n = 100) beyond which cell updates dominate and the time scales linearly.

      These are useful and non-trivial insights as they tell us that the RatInABox neuron models are quite efficient relative to the RatInABox random motion model (something we hope to optimise further down the line). We’ve added the following sentence to the results:

      “Our testing (Fig. 3f, inset) reveals that the combined time for updating the motion model and a population of PlaceCells scales sublinearly O(1) for small populations n > 1000 where updating the random motion model dominates compute time, and linearly for large populations n > 1000. PlaceCells, BoundaryVectorCells and the Agent motion model update times will be additionally affected by the number of walls/barriers in the Environment. 1D simulations are significantly quicker than 2D simulations due to the reduced computational load of the 1D geometry.”

      And this sentence to section 2:

      “RatInABox is fundamentally continuous in space and time. Position and velocity are never discretised but are instead stored as continuous values and used to determine cell activity online, as exploration occurs. This differs from other models which are either discrete (e.g. “gridworld” or Markov decision processes) or approximate continuous rate maps using a cached list of rates precalculated on a discretised grid of locations. Modelling time and space continuously more accurately reflects real-world physics, making simulations smooth and amenable to fast or dynamic neural processes which are not well accommodated by discretised motion simulators. Despite this, RatInABox is still fast; to simulate 100 PlaceCell for 10 minutes of random 2D motion (dt = 0.1 s) it takes about 2 seconds on a consumer grade CPU laptop (or 7 seconds for BoundaryVectorCells).”

      Whilst this would be very interesting it would likely represent quite a significant edit, requiring rewriting of almost all the geometry-handling code. We’re happy to consider changes like these according to (i) how simple they will be to implement, (ii) how disruptive they will be to the existing API, (iii) how many users would benefit from the change. If many users of the package request this we will consider ways to support it.

      • In general, the set of default parameters might want to be included in the main text (vs in the supplement).

      We also considered this but decided to leave them in the methods for now. The exact value of these parameters are subject to change in future versions of the software. Also, we’d prefer for the main text to provide a low-detail high-level description of the software and the methods to provide a place for keen readers to dive into the mathematical and coding specifics.

      • It still says you can only simulate 4 velocity or head directions, which might be limiting.

      Thanks for catching this. This constraint has been relaxed. Users can now simulate an arbitrary number of head direction cells with arbitrary tuning directions and tuning widths. The methods have been adjusted to reflect this (see section 6.3.4).

      • The code license should be mentioned in the Methods.

      We have added the following section to the methods:

      6.6 License RatInABox is currently distributed under an MIT License, meaning users are permitted to use, copy, modify, merge publish, distribute, sublicense and sell copies of the software.

    1. Author Response:

      Reviewer #1:

      The largest concern with the manuscript is its use of resting-state recordings in Parkinson's Disease patients on and off levodopa, which the authors interpret as indicative of changes in dopamine levels in the brain but not indicative of altered movement and other neural functions. For example, when patients are off medication, their UPDRS scores are elevated, indicating they likely have spontaneous movements or motor abnormalities that will likely produce changed activations in MEG and LFP during "rest". Authors must address whether it is possible to study a true "resting state" in unmedicated patients with severe PD. At minimum this concern must be discussed in the manuscript.

      We agree that Parkinson’s disease can lead to unwanted movements such as tremor as well as hyperkinesias. This would of course be a deviation from a resting state in healthy subjects. However, such movements are part of the disease and occur unwillingly. The main tremor in Parkinson’s disease is a rest tremor and - as the name already suggests – it occurs while not doing anything. Therefore, such movements can arguably be considered part of the resting state of Parkinson’s disease. Resting state activity with and without medication is therefore still representative for changes in brain activity in Parkinson’s patients and indicative of alterations due to medication.

      To further investigate the effect of movement in our patients, we subdivided the UPDRS part 3 score into tremor and non-tremor subscores. For the tremor subscore we took the mean of item 15 and 17 of the UPDRS, whereas for the non-tremor subscore items 1, 2, 3, 9, 10, 12, 13, and 14 were averaged. Following Spiegel et al., 2007, we classified patients as akinetic-rigid (non-tremor score at least twice the tremor score), tremor-dominant (tremor score at least twice as large as the non-tremor score), and mixed type (for the remaining scores). Of the 17 patients, 1 was tremor dominant and 1 was classified as mixed type (his/her non-tremor score was greater than tremor score). None of our patients exhibited hyperkinesias during the recording. To exclude that our results are driven by tremor-related movement, we re-ran the HMM without the tremor-dominant and the mixed-type patient (see Figure R1 response letter).

      ON medication results for all HMM states remained the same. OFF medication results for the Ctx-Ctx and STN-STN state remained the same as well. The Ctx-STN state OFF medication was split into two states: Sensorimotor-STN connectivity was captured in one state and all other types of Ctx-STN connections were captured in another state (see Figure 1 response letter. The important point is that the biological conclusions stand across these solutions. Regardless, both with and without the two subjects a stable covariance matrix entailing sensorimotor-STN connectivity was determined, which is the main finding for the Ctx-STN state OFF medication.

      We therefore discuss this issue now within the limitation section (page 20):

      “Both motor impairment and motor improvement can cause movement during the resting state in PD. While such movement is a deviation from a resting state in healthy subjects, such movements are part of the disease and occur unwillingly. Therefore, such movements can arguably be considered part of the resting state of Parkinson’s disease. None of the patients in our cohort experienced hyperkinesia during the recording. All patients except for two were of the akinetic-rigid subtype. We verified that tremor movement is not driving our results. Recalculating the HMM states without these 2 subjects, even though it slightly changed some particular aspects of the HMM solution did not materially affect the conclusions.”

      Figure R1: States obtained after removing one tremor dominant and one mixed type patient from analysis. Panel C shows the split OFF medication cortico-STN state. Most of the cortico-STN connectivity is captured by the state shown in the top row (Figure 1 C OFF). Only the motor-STN connectivity in the alpha and beta band (along with a medial frontal-STN connection in the alpha band) is captured separately by the states labeled “OFF Split” (Figure 1 C OFF SPLIT).

      This reviewer was unclear on why increased "communication" in the medial OFC in delta and theta was interpreted as a pathological state indicating deteriorated frontal executive function. Given that the authors provide no evidence of poor executive function in the patients studied, the authors must at least provide evidence from other studies linking this feature with impaired executive function.

      If we understand the comment correctly it refers to the statement in the abstract “Dopaminergic medication led to communication within the medial and orbitofrontal cortex in the delta/theta frequency range. This is in line with deteriorated frontal executive functioning as a side effect of dopamine treatment in Parkinson’s disease”

      This statement is based on the dopamine overdose hypothesis reported in the Parkinson’s disease (PD) literature (Cools 2001; Kelly et al. 2009; MacDonald and Monchi 2011; Vaillancourt et al. 2013). We have elaborated upon the dopamine overdose hypothesis in the discussion on page 16. In short, dopaminergic neurons are primarily lost from the substantia nigra in PD, which causes a higher dopamine depletion in the dorsal striatal circuitry than within the ventral striatal circuits (Kelly et al. 2009; MacDonald and Monchi 2011). Thus, dopaminergic medication to treat the PD motor symptoms leads to increased dopamine levels in the ventral striatal circuits including frontal cortical activity, which can potentially explain the cognitive deficits observed in PD (Shohamy et al. 2005; George et al. 2013). We adjusted the abstract to read:

      “Dopaminergic medication led to coherence within the medial and orbitofrontal cortex in the delta/theta frequency range. This is in line with known side effects of dopamine treatment such as deteriorated executive functions in Parkinson’s disease.”

      In this article, authors repeatedly state their method allows them to delineate between pathological and physiological connectivity, but they don't explain how dynamical systems and discrete-state stochasticity support that goal.

      To recapitulate, the HMM divides a continuous time series into discrete states. Each state is a time-delay embedded covariance matrix reflecting the underlying connectivity between brain regions as well as the specific temporal dynamics in the data when such state is active. See Packard et al., (1980) for details about how a time-delay embedding characterises a linear dynamical system.

      Please note that the HMM was used as a data-driven, descriptive approach without explicitly assuming any a-priori relationship with pathological or physiological states. The relation between biology and the HMM states, thus, purely emerged from the data; i.e. is empirical. What we claim in this work is simply that the features captured by the HMM hold some relation with the physiology even though the estimation of the HMM was completely unsupervised (i.e. blind to the studied conditions). We have added this point also to the limitations of the study on page 19 and the following to the introduction to guide the reader more intuitively (page 4):

      “To allow the system to dynamically evolve, we use time delay embedding. Theoretically, delay embedding can reveal the state space of the underlying dynamical system (Packard et al., 1980). Thus, by delay-embedding PD time series OFF and ON medication we uncover the differential effects of a neurotransmitter such as dopamine on underlying whole brain connectivity.”

      Reviewer #2:

      Sharma et al. investigated the effect of dopaminergic medication on brain networks in patients with Parkinson's disease combining local field potential recordings from the subthalamic nucleus and magnetencephalography during rest. They aim to characterize both physiological and pathological spectral connectivity.

      They identified three networks, or brain states, that are differentially affected by medication. Under medication, the first state (termed hyperdopaminergic state) is characterized by increased connectivity of frontal areas, supposedly responsible for deteriorated frontal executive function as a side effect of medical treatment. In the second state (communication state), dopaminergic treatment largely disrupts cortico-STN connectivity, leaving only selected pathways communicating. This is in line with current models that propose that alleviation of motor symptoms relates to the disruption of pathological pathways. The local state, characterized by STN-STN oscillatory activities, is less affected by dopaminergic treatment.

      The authors utilize sophisticated methods with the potential to uncover the dynamics of activities within different brain network, which opens the avenue to investigate how the brain switches between different states, and how these states are characterized in terms of spectral, local, and temporal properties. The conclusions of this paper are mostly well supported by data, but some aspects, mainly about the presentation of the results, remain:

      We would like to thank the reviewer for his succinct and clear understanding of our work.

      1) The presentation of the results is suboptimal and needs improvement to increase readers' comprehension. At some points this section seems rather unstructured, some results are presented multiple times, and some passages already include points rather suitable for the discussion, which adds too much information for the results section.

      We have removed repetitions in the results sections and removed the rather lengthy introductory parts of each subsection. Moreover, we have now moved all parts, which were already an interpretation of our findings to the discussion.

      2) It is intriguing that the hyperdopaminergic state is not only identified under medication but also in the off-state. This is intriguing, especially with the results on the temporal properties of states showing that the time of the hyperdopaminergic state is unaffected by medication. When such a state can be identified even in the absence of levodopa, is it really optimal to call it "hyperdopaminergic"? Do the results not rather suggest that the identified network is active both off and on medication, while during the latter state its' activities are modulated in a way that could relate to side effects?

      The reviewer’s interpretations of the results pertaining to the hyper-dopaminergic state are correct. The states had been named post-hoc as explained in the results section. The hyper-dopaminergic state’s name derived from it showing the overdosing effects of dopamine. Of course, these results are only visible on medication. But off medication, this state also exists without exhibiting the effects of excess dopamine. To avoid confusion or misinterpretation of the findings and also following the relevant comment by reviewer 1, we renamed all states to be more descriptive:

      Hyperdopaminergic > Cortico-cortical state

      Communication > Cortico-STN state

      Local > STN-STN state.

      3) Some conclusions need to be improved/more elaborated. For example, the coherence of bilateral STN-STN did not change between medication off and on the state. Yet it is argued that a) "Since synchrony limits information transfer (Cruz et al. 2009; Cagnan, Duff, and Brown 2015; Holt et al. 2019) , local oscillations are a potential mechanism to prevent excessive communication with the cortex" (line 436) and b) "Another possibility is that a loss of cortical afferents causes local basal ganglia oscillations to become more pronounced" (line 438). Can these conclusions really be drawn if the local oscillations did not change in the first place?

      We apologize for the unclear description. Our conclusion was based on the following results:

      a) We state that STN-STN connectivity as measured by the magnitude of STN-STN coherence does not change OFF vs ON medication in the Cortico-STN state. This result is obtained using inter-medication analysis.

      b) But ON medication, STN-STN coherence in the Cortico-STN state was significantly different from mean coherence within the ON condition. These results are obtained using intra-medication analysis.

      Based on this, we conclude that in the Cortico-STN state, although OFF vs ON medication the magnitude of STN-STN coherence was unchanged, the STN-STN coherence was significantly different from mean coherence in the ON medication condition. The emergence of synchronous STN-STN activity may limit information exchange between STN and cortex ON medication.

      An alternative explanation for these findings might be a mechanism preventing connectivity between cortex and the STN ON medication. This missing interaction between STN and cortex might cause STN-STN oscillations to increase compared to the mean coherence within the ON state. Unfortunately, we cannot test such causal influences with our analysis.

      We have added the following discussion to the manuscript on page 17 in order to improve the exposition:

      “Bilateral STN–STN coherence in the alpha and beta band did not change in the cortico-STN state ON versus OFF medication (InterMed analysis). However, STN-STN coherence was significantly higher than the mean level ON medication (IntraMed analysis). Since synchrony limits information transfer (Cruz et al. 2009; Cagnan, Duff, and Brown 2015; Holt et al. 2019), the high coherence within the STN ON medication could prevent communication with the cortex. A different explanation would be that a loss of cortical afferents leads to increased local STN coherence. The causal nature of the cortico-basal ganglia interaction is an endeavour for future research.”

      Reviewer #3:

      In PD, pathological neuronal activity along the cortico-basal ganglia network notably consists in the emergence of abnormal synchronized oscillatory activity. Nevertheless, synchronous oscillatory activity is not necessarily pathological and also serve crucial cognitive functions in the brain. Moreover, the effect of dopaminergic medication on oscillatory network connectivity occurring in PD are still poorly understood. To clarify these issues, Sharma and colleagues simultaneously-recorded MEG-STN LFP signals in PD patients and characterized the effect of dopamine (ON and OFF dopaminergic medication) on oscillatory whole-brain networks (including the STN) in a time-resolved manner. Here, they identified three physiologically interpretable spectral connectivity patterns and found that cortico-cortical, cortico-STN, and STN-STN networks were differentially modulated by dopaminergic medication.

      Strengths:

      1) Both the methodological and experimental approaches used are thoughtful and rigorous.

      a) The use of an innovative data-driven machine learning approach (by employing a hidden Markov model), rather than hand-crafted analyses, to identify physiologically interpretable spectral connectivity patterns (i.e., distinct networks/states) is undeniably an added value. In doing so, the results are not biased by the human expertise and subjectivity, which make them even more solid.

      b) So far, the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD was evaluated/assessed to specific cortico-STN spectral connectivity. Conversely, whole-brain MEG studies in PD patients did not account for cortico-STN and STN-STN connectivity. Here, the authors studied, for the first time, the whole-brain connectivity including the STN (whole brain-STN approach) and therefore provide new evidence of the brain connectivity reported in PD, as well as new information regarding the effect of dopaminergic medication on the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD.

      2) Studying the temporal properties of the recurrent oscillatory patterns of transient network connectivity both ON and OFF medication is extremely important and provide interesting and crucial information in order to delineated pathological versus physiologically-relevant spectral brain connectivity in PD.

      We would like to thank the reviewer for their valuable feedback and correct interpretation of our manuscript.

      Weaknesses:

      1) In this study, the authors implied that the ON dopaminergic medication state correspond to a physiological state. However, as correctly mentioned in the limitations of the study, they did not have (for obvious reasons) a control/healthy group. Moreover, no one can exclude the emergence of compensatory and/or plasticity mechanisms in the brain of the PD patients related to the duration of the disease and/or the history of the chronic dopamine-replacement therapy (DRT). Duration of the disease and DRT history should be therefore considered when characterizing the recurrent oscillatory patterns of transient network connectivity within and between the cortex and the STN reported in PD, as well as when examining the effect of the dopaminergic medication on the functioning of these specific networks.

      We would like to thank the reviewer for pointing this out. We regressed duration of disease (year of measurement – year of onset) on the temporal properties of the HMM states. We found no relationship between any of the temporal properties and disease duration. Similarly, we regressed levodopa equivalent dosage for each subject on the temporal properties and found no relationship. We now discuss this point in the manuscript (page 20):

      “A further potential influencing factor might be the disease duration and the amount of dopamine patients are receiving. Both factors were not significantly related to the temporal properties of the states.”

      2) Here, the authors recorded LFPs in the STN activity. LFP represents sub-threshold (e.g., synaptic input) activity at best (Buzsaki et al., 2012; Logothetis, 2003). Recent studies demonstrated that mono-polar, but also bi-polar, BG LFPs are largely contaminated by volume conductance of cortical electroencephalogram (EEG) activity even when re-referenced (Lalla et al., 2017; Marmor et al., 2017). Therefore, it is likely that STN LFPs do not accurately reflect local cellular activity. In this study, the authors examined and measured coherence between cortical areas and STN. However, they cannot guarantee that STN signals were not contaminated by volume conducted signals from the cortex.

      We appreciate this concern and thank the reviewer for bringing it up. Marmor et al. (2017) investigated this on humans and is therefore most closely related to our research. They find that re-referenced STN recordings are not contaminated by cortical signals. Furthermore, the data in Lalla et al. (2017) is based on recordings in rats, making a direct transfer to human STN recordings problematic due to the different brain sizes. Since we re-referenced our LFP signals as recommended in the Marmor paper, we think that contamination due to cortical signals is relatively minor; see Litvak et al. (2011), Hirschmann et al. (2013), and Neumann et al. (2016) for additional references supporting this. That being said, we now discuss this potential issue in the paper on page 20.

      “Lastly, we recorded LFPs from within the STN –an established recording procedure during the implantation of DBS electrodes in various neurological and psychiatric diseases. Although for Parkinson patients results on beta and tremor activity within the STN have been reproduced by different groups (Reck et al. 2010, Litvak et al. 2011, Florin et al. 2013, Hirschmann et al. 2013, Neumann et al. 2016), it is still not fully clear whether these LFP signals are contaminated by volume-conducted cortical activity. However, while volume conduction seems to be a larger problem in rodents even after re-referencing the LFP signal (Lalla et al. 2017), the same was not found in humans (Marmor et al. 2017).”

      3) The methods and data processing are rigorous but also very sophisticated which make the perception of the results in terms of oscillatory activity and neural synchronization difficult.

      To aid intuition on how to interpret the result in light of the methods used, one can compare the analysis pipeline to a windowing approach. In a more standard approach, windows of different time length can be defined for different epochs within the time series and for each window coherence and connectivity can be determined. The difference in our approach is that we used an unsupervised learning algorithm to select windows of varying length based on recurring patterns of whole brain network activity. Within those defined windows we then determine the oscillatory properties via coherence and power – which is the same as one would do in a classical analysis. We have added an explanation of the concept of “oscillatory activity” within our framework to the introduction (page 2 footnote):

      “For the purpose of our paper, we refer to oscillatory activity or oscillations as recurrent, but transient frequency–specific patterns of network activity, even though the underlying patterns can be composed of either sustained rhythmic activity, neural bursting, or both (Quinn et al. 2019).”

      Moreover, we provide a more intuitive explanation of the analysis within the first section of the results (page 4):

      “Using an HMM, we identified recurrent patterns of transient network connectivity between the cortex and the STN, which we henceforth refer to as an ‘HMM state’. In comparison to classic sliding-window analysis, an HMM solution can be thought of as a data-driven estimation of time windows of variable length (within which a particular HMM state was active): once we know the time windows when a particular state is active, we compute coherence between different pairs of regions for each of these recurrent states.”

      4) Previous studies have shown that abnormal oscillations within the STN of PD patients are limited to its dorsolateral/motor region, thus dividing the STN into a dorsolateral oscillatory/motor region and ventromedial non-oscillatory/non-motor region (Kuhn et al. 2005; Moran et al. 2008; Zaidel et al. 2009, 2010; Seifreid et al. 2012; Lourens et al. 2013, Deffains et al., 2014). However, the authors do not provide clear information about the location of the LFP recordings within the STN.

      We selected the electrode contacts based on intraoperative microelectrode recordings (for details, see page 23). The first directional recording height after the entry into the STN was selected to obtain the three directional LFP recordings from the respective hemisphere. This practice has been proven to improve target location (Kochanski et al., 2019; Krauss et al., 2021). The common target area for DBS surgery is the dorsolateral STN. To confirm that the electrodes were actually located within this part of the STN, we now reconstructed the DBS location with Lead-DBS (Horn et al. 2019). All electrodes – except for one – were located within the dorsolateral STN (see figure 7 of the manuscript). To exclude that our results were driven by outlier, we reanalysed our data without this patient. No change in the overall connectivity pattern was observed (see figure R3 of the response letter).

      Figure R2: Lead DBS reconstruction of the location of electrodes in the STN for different subjects. The red electrodes have not been placed properly in the STN. The contacts marked in red represent the directional contacts from which the data was used for analysis.

      Figure R3: HMM states obtained after running the analysis without the subject with the electrode outside the STN.

      References:

      Buzsáki G, Anastassiou CA, Koch C. The origin of extracellular fields and currents-EEG, ECoG, LFP and spikes. Nat Rev Neurosci 2012; 13: 407–20.

      Cagnan H, Duff EP, Brown P. The relative phases of basal ganglia activities dynamically shape effective connectivity in Parkinson’s disease. Brain 2015; 138: 1667–78.

      Cools R. Enhanced or impaired cognitive function in Parkinson’s disease as a function of dopaminergic medication and task demands. Cereb Cortex 2001; 11: 1136–43.

      Cruz A V., Mallet N, Magill PJ, Brown P, Averbeck BB. Effects of dopamine depletion on network entropy in the external globus pallidus. J Neurophysiol 2009; 102: 1092–102.

      Florin E, Erasmi R, Reck C, Maarouf M, Schnitzler A, Fink GR, et al. Does increased gamma activity in patients suffering from Parkinson’s disease counteract the movement inhibiting beta activity? Neuroscience 2013; 237: 42–50.

      George JS, Strunk J, Mak-Mccully R, Houser M, Poizner H, Aron AR. Dopaminergic therapy in Parkinson’s disease decreases cortical beta band coherence in the resting state and increases cortical beta band power during executive control. NeuroImage Clin 2013; 3: 261–70.

      Hirschmann J, Özkurt TE, Butz M, Homburger M, Elben S, Hartmann CJ, et al. Differential modulation of STN-cortical and cortico-muscular coherence by movement and levodopa in Parkinson’s disease. Neuroimage 2013; 68: 203–13.

      Holt AB, Kormann E, Gulberti A, Pötter-Nerger M, McNamara CG, Cagnan H, et al. Phase-dependent suppression of beta oscillations in parkinson’s disease patients. J Neurosci 2019; 39: 1119–34.

      Horn A, Li N, Dembek TA, Kappel A, Boulay C, Ewert S, et al. Lead-DBS v2: Towards a comprehensive pipeline for deep brain stimulation imaging. Neuroimage 2019; 184: 293–316.

      Kelly C, De Zubicaray G, Di Martino A, Copland DA, Reiss PT, Klein DF, et al. L-dopa modulates functional connectivity in striatal cognitive and motor networks: A double-blind placebo-controlled study. J Neurosci 2009; 29: 7364–78.

      Kochanski RB, Bus S, Brahimaj B, Borghei A, Kraimer KL, Keppetipola KM, et al. The impact of microelectrode recording on lead location in deep brain stimulation for the treatment of movement disorders. World Neurosurg 2019; 132: e487–95.

      Krauss P, Oertel MF, Baumann-Vogel H, Imbach L, Baumann CR, Sarnthein J, et al. Intraoperative neurophysiologic assessment in deep brain stimulation surgery and its impact on lead placement. J Neurol Surgery, Part A Cent Eur Neurosurg 2021; 82: 18–26.

      Lalla L, Rueda Orozco PE, Jurado-Parras MT, Brovelli A, Robbe D. Local or not local: Investigating the nature of striatal theta oscillations in behaving rats. eNeuro 2017; 4: 128–45.

      Litvak V, Jha A, Eusebio A, Oostenveld R, Foltynie T, Limousin P, et al. Resting oscillatory cortico-subthalamic connectivity in patients with Parkinson’s disease. Brain 2011; 134: 359–74.

      MacDonald PA, MacDonald AA, Seergobin KN, Tamjeedi R, Ganjavi H, Provost JS, et al. The effect of dopamine therapy on ventral and dorsal striatum-mediated cognition in Parkinson’s disease: Support from functional MRI. Brain 2011; 134: 1447–63.

      MacDonald PA, Monchi O. Differential effects of dopaminergic therapies on dorsal and ventral striatum in Parkinson’s disease: Implications for cognitive function. Parkinsons Dis 2011; 2011: 1–18.

      Marmor O, Valsky D, Joshua M, Bick AS, Arkadir D, Tamir I, et al. Local vs. volume conductance activity of field potentials in the human subthalamic nucleus. J Neurophysiol 2017; 117: 2140–51.

      Neumann WJ, Degen K, Schneider GH, Brücke C, Huebl J, Brown P, et al. Subthalamic synchronized oscillatory activity correlates with motor impairment in patients with Parkinson’s disease. Mov Disord 2016; 31: 1748–51.

      Packard NH, Crutchfield JP, Farmer JD, Shaw RS. Geometry from a time series. Phys Rev Lett 1980; 45: 712–6.

      Quinn AJ, van Ede F, Brookes MJ, Heideman SG, Nowak M, Seedat ZA, et al. Unpacking Transient Event Dynamics in Electrophysiological Power Spectra. Brain Topogr 2019; 32: 1020–34.

      Reck C, Himmel M, Florin E, Maarouf M, Sturm V, Wojtecki L, et al. Coherence analysis of local field potentials in the subthalamic nucleus: Differences in parkinsonian rest and postural tremor. Eur J Neurosci 2010; 32: 1202–14.

      Shohamy D, Myers CE, Grossman S, Sage J, Gluck MA. The role of dopamine in cognitive sequence learning: Evidence from Parkinson’s disease. Behav Brain Res 2005; 156: 191–9.

      Spiegel J, Hellwig D, Samnick S, Jost W, Möllers MO, Fassbender K, et al. Striatal FP-CIT uptake differs in the subtypes of early Parkinson’s disease. J Neural Transm 2007; 114: 331–5.

      Vaillancourt DE, Schonfeld D, Kwak Y, Bohnen NI, Seidler R. Dopamine overdose hypothesis: Evidence and clinical implications. Mov Disord 2013; 28: 1920–9.

    1. Author Response

      Reviewer #1 (Public Review)

      [...] One potential issue is that the high myelination signal is associated with the compartment in V2 (pale stripes) which was not functionally defined itself but by the absence of specific functional activations. No difference was reported between those stripes that were defined functionally. Other explanations for the differential pattern of a qMRI signals, e.g. ROI distribution for presumed pale stripes is not evenly distributed (more foveal), ROIs with low activations due to some other factor show higher myelin-related signals, cannot be excluded based on the analysis presented.

      Indeed, it would have been advantageous to directly functionally delineate pale stripes in V2. Since we were not able to achieve this by fMRI, we needed an indirect method to infer pale stripe contributions in the analysis. We also added a statement in the discussion section to emphasize this more (p. 9, lines 286–288).

      Furthermore, different myelination between thin and thick stripes was not tested, since we did not have a concrete hypothesis on this. Despite the conflicting findings of stronger myelination in dark or pale CO stripes in the literature, no histological study stated myelination differences between dark CO thin and thick stripes. Therefore, our primary interest and hypothesis was lying in comparing the different myelination of thin/thick and pale stripes using MRI.

      Thank you very much for this comment about potential other sources of differential qMRI parameter patterns. Indeed, based on the original analysis we could not exclude that the absence of functional activation around the foveal representation may have biased our analysis. We therefore added a supporting analysis, in which we excluded the region around the foveal representation from the analysis. The excluded cortical region was kept consistent between participants by excluding the same eccentricity range in all maps. We added more details in the results section of the revised manuscript (p. 8, lines 189–202). In Figure 5-Supplement 1 and Figure 5-Supplement 3, results from this supporting analysis are shown which reproduced the primary findings from the main analysis, particularly the relatively higher myelination of pale stripes.

      ROI definitions solely based on fMRI activation amplitude have additional limitations. However, we find it unlikely that a small fMRI effect size and low contrast-to-noise ratio (i.e. stochastic cause of low statistical parameter values/”activation”) has impacted the results, since Figure 3 shows that we could achieve a high degree of reproducibility for each participant.

      We would note that the fact that we found consistent differences across MPM and MP2RAGE sessions makes some potential artifacts driving the differences unlikely. We also find it unlikely that systematic cerebral blood volume differences between stripes would have driven the results. A higher local blood volume would lead to increased BOLD responses but also to a higher R1 value due to the deoxy-hemoglobin induced relaxation, which is opposite to the observation of higher activity in the thick/thin stripes but lower R1 values.

      Further studies using other functional metrics (e.g. VASO, ASL etc.) may help us to even more clearly demonstrate specificity but were out of the scope of this already rather extensive study. Although we have added extensive further analyses in the revised manuscript such as controlling for foveal effects or registration performance, we did not see a possibility to fully exclude a systematic bias that might potentially be caused by unknown factors.

      Another theoretical and practical issue is the question of "ground truth" for the non-invasive qMRI measures, as the authors - as their starting point - roundly dismiss direct histological tissue studies as conflicting, rather than take a critical look at the merit of the conflicting study results and provide a best hypothesis. If so, they need to explain better how they calibrate their non-invasive MR measurements of myelin.

      We agree and have now further elaborated on the limits of specificity of the R1 and R2* signal as cortical myelin marker (p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). However, we still think that it is important for the reader to appreciate the conflicting results in histological studies using staining methods for myelin, which adds to the study’s background.

      We did not intend to give the impression that MRI provides the missing ground-truth to adjudicate histological controversies, but that it provides an alternative and additional view on the open questions. We changed the introduction to better reflect the aspect that the study offers a unique view by providing myelination proxies and functional measures in the same individual, which allows for direct comparison and investigation of structure-function relationships (see p. 2, lines 68–70; p. 3, lines 93–95), which is not accessible to any other approach. Nevertheless, we would like to note that R1 has been well established as a myelin marker under particular conditions (Kirilina et al., 2020; Mancini et al., 2020; Lazari and Lipp, 2021). It has also been widely used for cortical myelin mapping across a variety of populations, systems and field strengths. We added this statement to the introduction (see p. 2, lines 82-85). We note that we excluded volunteers with pathologies or neurological disorders from the study and their mean age was about 28 years. Thus, we had conditions comparable to previous (validation) studies.

      Because of the contradictory findings of histological studies, we could not further finesse the hypothesis beyond our previous a priori hypothesis that we expected differences in the myelin sensitive MRI metrics between the thin/thick versus pale stripes. To improve the contextual understanding, we added a paragraph in the discussion section covering in more depth how the MRI results relate to known histological findings (see pp. 8–9, lines 216–240).

      While this paper makes an important contribution to the question of the association of specific myelination patterns defining the columnar architecture in V2, it is not entirely clear whether the authors can fully resolve it with the data presented.

      Indeed, we agree that non invasive aggregate measures, such as the R1 metrics, offer limited specificity which precludes a fully conclusive inference about cortical myelination. We have further emphasized this on several occasions in the text (see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260). Since the correspondence of cortical myelin levels and R1 (and other metrics) is an active area of research, we expect that the understanding, sensitivity and specificity of R1 to cortical myelination will further improve. We note that the use of qMRI is a substantial advance over weighted MRI typically used, which suffers from lack of specificity due to instrumental idiosyncrasies and varying measurement conditions.

      Reviewer #2 (Public Review)

      [...] Unfortunately, this particular study seems to fall into an unhappy middle ground in terms of the conclusions that can be drawn: the relaxometry measures lack the specificity to be considered "ground truth", while the authors claim that the literature lacks consensus regarding the structures that are being studied. The authors propose that their results resolve whether or not stripes differ in their patterns of myelination, but R1 lacks the specificity to do this. While myelin is a primary driver of relaxation times in cortex, relaxometry cannot be considered to be specific to myelin. It is possible that the small observed changes in R1 are driven by myelin, but they could also reflect other tissue constituents, particularly given the small observed effect sizes. If the literature was clear on the pattern of myelination across stripes, this study could confirm that R1 measurements are sensitive to and consistent with this pattern. But the authors present the work as resolving the question of how myelination differs between stripes, which over-reaches what is possible with this method. As it stands, the measured differences in R1 between functionally-defined cortical regions are interesting, but require further validation (e.g., using invasive myelin staining).

      We agree that we have inadvertently overstated the specificity of R1 at several occasions in the text. We therefore toned down the statements concerning the correspondence between R1 and myelin throughout the manuscript (e.g. see p. 2, lines 68–88; p. 6, line 163; p. 8, line 216; p. 9, lines. 257–260).

      We also removed the phrase that gave the impression that MRI can conclusively resolve the conflicting results found in histological studies. In the Introduction, we changed the corresponding paragraph by emphasizing the alternative view, which can be obtained from MRI by the possibility to investigate structure-function relationships in the living human brain, which would not be possible by invasive myelin staining (see p. 2, lines 68–70; p. 3, lines 93–95).

      We acknowledge that – perhaps aside from electron microscopy – all common markers have shortcomings, which limit their specificity. For example, classic histology is not quantitative and resulted in conflicting results. It even includes the very fundamental issue, that the composition of myelin varies across the brain and within brain areas significantly (e.g., its lipid composition (González de San Román et al., 2018)). Thus, we regard the different invasive/non-invasive measures as complementary. R1 adds to this arsenal of measures and can be acquired non invasively. It has been shown to be a reliable myelin marker under certain circumstances. It follows the known myeloarchitecture patterns of the human brain, which was also checked for the data of the present study (see Figure 4 and Appendix 2). It is responsive to traumatic changes (Freund et al., 2019), development (Whitaker et al., 2016; Carey et al., 2018; Natu et al., 2019) and plasticity (Lazari et al., 2022). Since we studied healthy volunteers with no known pathologies that were sampled randomly from the population, we believe that the previous results generally apply and suggest sufficient specificity of the R1 marker. Of course, we cannot fully exclude bias due to unknown factors that have not been investigated/discovered by validation studies yet. However, in this case we expect that the systematic differences between stripe types would remain an important result most likely pointing to another interesting biological difference between stripes.

      While more research is needed to clarify the precise role of R1 for cortical myelin, we think that the meaningful determination of quantitative MR parameter within one cortical area is still interesting for the neuroscientific community.

      Moreover, the results make clear that R1 differences are not sufficiently strong to provide an independent measure of this structure (e.g., for segmentation of stripe). As such, one would still require fMRI to localise stripes, making it unclear what role R1 measures would play in future studies.

      Indeed, the observed small effect sizes in the present study still requires a functional localization with fMRI. We expected small effect sizes using R1 and R2* due to the known small inter-areal or intra-cortical differences of MRI myelin markers. Therefore, this study aimed at a proof-of-concept investigating whether intra-areal R1 differences at the spatial scale of columnar structures can be detected using non-invasive MRI. Our study shows that these differences can be seen but currently not at the single voxel level. We anticipate that with further improvements in sequence development and scanner hardware, high-resolution R1 estimates with sufficient SNR can be acquired making fMRI redundant (for this kind of investigations). Please see the reply to the next comment concerning the impact of using R1 in future studies.

      The Introduction concludes with the statement that "Whereas recent studies have explored cortical myelination ... using non-quantitative, weighted MR images... we showed for the first time myelination differences using MRI on a quantitative basis". As written, this sentence implies that others have demonstrated that simpler non-quantitative imaging can achieve the same aims as qMRI. Simply showing that a given method is able to achieve an aim would not be sufficient: the authors should demonstrate that this constitutes an important advance.

      Thank you for this comment. It goes to the heart of the concerns raised about specificity and sensitivity of MRI based myelin metrics. We elaborate here on the main advantage of using qMRI in our current study and why it is more specific than weighted MR imaging. However, we emphasize that a thorough comparison between qMRI and weighted MRI is highly complex and refer to our recent review paper on qMRI for further details (Weiskopf et al., 2021), which are beyond the scope of our paper. The signal in weighted MRI, even when optimally optimized to the tissue of interest, additionally depends on both inhomogeneities in the RF transmit and receive (bias) fields. Other methods like using a ratio image (T1w/T2w) can cancel out the receive field bias entirely (in the case of no subject movements between scans) but not the transmit field bias. This hampers the direct analysis and interpretation of signal differences between distant regions of the brain. For high resolution imaging applications, the usage of high magnetic fields such as 7 T is beneficial or even mandatory due to signal-to-noise (SNR) penalties. With increasing field strength, these inhomogeneities also apply to small regions as V2. For these cases, qMRI is advantageous since it provides metrics which are free from these technical biases, significantly improving the specificity. As high-field MRI has the potential to non invasively study the structure and function of the human brain at the spatial scale of cortical layers and cortical columns, we believe that the results of our current study, which successfully demonstrate the applicability of qMRI to robustly detect small differences at the level of columnar systems, is relevant for future studies in the field of neuroscience.

      We emphasized these considerations in the revised manuscript (see. p. 9, lines 273–285).

      The study includes a very small number of participants (n=4). The advantage of non-invasive in-vivo measurements, despite the fact that they are indirect measures, should be that one can study a reasonable number of subjects. So this low n seems to undermine that point. I rarely suggest additional data collection, but I do feel that a few more subjects would shore up the study's impact.

      The present study was conducted in line with a deep phenotyping study approach. That is, we focused on acquiring highly reliable datasets on individuals. We did not intend to capture the population variance, which is often the goal of other group studies, since low level and basic features such as stripes in V2 are expected to be present in all healthy individuals. Thus we traded off and prioritized test-retest measurements for fMRI sessions and using an alternative MP2RAGE acquisition over a larger number of individuals. This resulted in 6–7 scanning sessions on different days for each individual, summing up to 26 long scanning session in total. We also note that the used sample size is not smaller than in other studies with a similar research question. For example, another fMRI study investigating V2 stripes in humans used the same sample size of n=4 (Dumoulin et al., 2017).

      The paper overstates what can be concluded in a number of places. For example, the paper suggests that R1 and R2 are highly-specific to myelin in a number of places. For example, on p7 the text reads" "We tested whether different stripe types are differentially myelinated by comparing R1 and R2..." Relaxation times lack the specificity to definitively attribute these changes purely to myelin. Similarly, on p11: "Our study showed that pale stripes which exhibit lower oxidative metabolic activity according to staining with CO are stronger myelinated than surrounding gray matter in V2." This implies that the study directly links CO staining to myelination. In addition to using non-specific estimates of myelination, the study does not actually measure CO.

      We agree that we did not clearly point out the limitations of R1 myelin mapping. Therefore, we toned down the statements about the connection between cortical myelin and R1. The mentioned statements in the reviewer’s comment were changed accordingly (see p. 6, line 163; p. 11, lines 353–354). We also included a small paragraph to clarify the used terminology (color-selective thin stripes, disparity-selective thick stripes) in the manuscript (see p. 4, lines 110–114) to avoid the inadvertent conflation of CO staining and actually measured brain activity.

      I'm confused by the analysis in Figure 5. I can appreciate why the authors are keen to present a "tripartite" analysis (thick, thin, and pale stripes). But I find the gray curves confusing. As I understand it, the gray curves as generated include both the stripe of interest (red or blue plots) and the pale stripes. Why not just generate a three-way classification? Generating these plots in effect has already required hard classification of thin and thick stripes, so it is odd to create the gray plots, which mix two types of stripes. Alternatively, could you explicitly model the partial volume for a given cortical location (e.g., under the assumption that partial volume of thick and thin strips is indicated by the z-score) for the corresponding functional contrast? One could then estimate the relaxation times as a simple weighted sum of stripe-wise R1 or R2.

      Figure on weighted average of stripe-wise R1 and R2. (a) shows the weighted sum of R1 (de-meaned and de-curved) over all V2 voxels. z-scores from color-selective thin stripe experiments and disparity-selective thick stripes were used as weights in the left and middle group of bars, respectively. An intermediate threshold of zmax=1.96 was used, i.e., final weights were defined as weights=(z-1.96). Weights with z<0 were set to 0. For pale stripes (right group of bars), we used the maximum z-score value from thin and thick stripe measurements. We then set all weights with z≥1.96 to 0 and used the inverse as final weights. i.e., weights = -1 * (max(z)-1.96). (b) shows the same analysis for R2. Error bars indicate 1 standard error of the mean.

      (1) Yes, indeed. We agree that modeling the partial volume of each compartment (thin, thick and pale stripes) in each V2 voxel would be the most elegant approach. However, we note that z-scores between thin and thick stripe experiments may not reflect the voxel-wise partial volume effect, since they are a purely statistical measure and not a partial volume model. Having said this, we think that this general approach can give some additional insights and we provide results for a similar analysis here. We calculated the weighted sum of R1 and R2 values over all V2 voxels for each stripe compartment (thin, thick and pale stripes) independently (see above figure). For R1, we see the same pattern of R1 between stripe types as in the manuscript (Figure 5). Additionally, we show the differences here for each subject, which further demonstrates the reproducibility across subjects in our study. For R2, no clear pattern across subjects emerged, confirming the results in our manuscript. Since, this analysis did not add relavant new information to the manuscript, we refrained from adding this figure to the manuscript, in order not to overload it.

      (2) In our current study, we were not primarily interested in investigating differences between thin/thick stripes and pale stripes. While histological analysis found differences (though not consistent) between CO dark stripes (more myelinated, (Tootell et al., 1983)) and CO pale stripes (more myelinated, Krubitzer and Kaas, 1989)), no study stated myelin differences between CO dark stripes. This does not fully exclude the possibility of myelination differences but suggests that if myelination differences between CO dark stripes existed, they would presumably be smaller than differences between CO dark and CO pale stripes. Thus, it would be even more difficult to demonstrate than the hypothesis of this manuscript.

      Therefore, we decided to directly test two compartments against each other instead of modeling all three compartments within a single model. In our analysis, we thereby loosely followed the analysis methods described in Li et al. (2019), which compared myelin differences between thin/thick and pale stripes in macaques. We note that this demonstrates further consistency, since it is not trivial that both thick and thin stripes show lower R1 values than the pale stripes. For example, there may be no or opposite differences.

      (3) Just for clarification, the plots in Figure 5 show the comparison of R1 (or R2*) between two compartments in V2. The red (blue) curve includes the thin (thick) stripe of interest. The gray curve includes everything in V2 minus contributions from thick (thin) stripes of interest. If we take the thin stripe comparison as example (Figure 5a), then red contains the thin stripes of interest while gray contains everything minus the thick stripes. Therefore, assuming a tripartite stripe arrangement, the gray curve contains both thin and pale stripe contributions.

      References

      Carey D, Caprini F, Allen M, Lutti A, Weiskopf N, Rees G, Callaghan MF, Dick F. Quantitative MRI provides markers of intra-, inter-regional, and age-related differences in young adult cortical microstructure. Neuroimage 2018; 182:429–440.

      Dumoulin SO, Harvey BM, Fracasso A, Zuiderbaan W, Luijten PR, Wandell BA, Petridou N. In vivo evidence of functional and anatomical stripe-based subdivisions in human V2 and V3. Sci Rep 2017; 7:733.

      Freund P, Seif M, Weiskopf N, Friston K, Fehlings MG, Thompson AJ, Curt A. MRI in traumatic spinal cord injury: from clinical assessment to neuroimaging biomarkers. Lancet Neurol 2019; 18:1123–1135.

      González de San Román E, Bidmon H-J, Malisic M, Susnea I, Küppers A, Hübbers R, Wree A, Nischwitz V, Amunts K, Huesgen PF. Molecular composition of the human primary visual cortex profiled by multimodal mass spectrometry imaging. Brain Struct Func 2018; 223:2767–2783.

      Kirilina E, Helbling S, Morawski M, Pine K, Reimann K, Jankuhn S, Dinse J, Deistung A, Reichenbach JR, Trampel R, Geyer S, Müller L, Jakubowski N, Arendt T, Bazin P-L, Weiskopf N. Superficial white matter imaging: Contrast mechanisms and whole-brain in vivo mapping. Sci Adv 2020; 6:eaaz9281.

      Krubitzer LA, Kaas JH. Cortical integration of parallel pathways in the visual system of primates. Brain Res 1989; 478:161–165.

      Lazari A, Lipp I. Can MRI measure myelin? Systematic review, qualitative assessment, and meta-analysis of studies validating microstructural imaging with myelin histology. Neuroimage 2021; 230:117744.

      Lazari A, Salvan P, Cottaar M, Papp D, Rushworth MFS, Johansen-Berg H. Hebbian activity-dependent plasticity in white matter. Cell Rep 2022; 39:110951.

      Li X, Zhu Q, Janssens T, Arsenault JT, Vanduffel W. In Vivo Identification of Thick, Thin, and Pale Stripes of Macaque Area V2 Using Submillimeter Resolution (f)MRI at 3 T. Cereb 2019; 29:544–560.

      Mancini M, Karakuzu A, Cohen-Adad J, Cercignani M, Nichols TE, Stikov N. An interactive meta-analysis of MRI biomarkers of myelin. Elife 2020; 9:e61523.

      Natu VS, Gomez J, Barnett M, Jeska B, Kirilina E, Jaeger C, Zhen Z, Cox S, Weiner KS, Weiskopf N, Grill-Spector K. Apparent thinning of human visual cortex during childhood is associated with myelination. PNAS 2019; 116:20750–20759.

      Tootell RBH, Silverman MS, De Valois RL, Jacobs GH. Functional Organization of the Second Cortical Visual Area in Primates. Science 1983; 220:737–739.

      Weiskopf N, Edwards LJ, Helms G, Mohammadi S, Kirilina E. Quantitative magnetic resonance imaging of brain anatomy and in vivo histology. Nat Rev Phys 2021; 3:570–588.

      Whitaker KJ, Vértes PE, Romero-Garcia R, Váša F, Moutoussis M, Prabhu G, Weiskopf N, Callaghan MF, Wagstyl K, Rittman T, Tait R, Ooi C, Suckling J, Inkster B, Fonagy P, Dolan RJ, Jones PB, Goodyer IM, NSPN Consortium, Bullmore ET. Adolescence is associated with genomically patterned consolidation of the hubs of the human brain connectome. PNAS 2016; 113:9105–9110.

    1. Author Response

      Reviewer #1 (Public Review):

      Huang et al. sought to study the cellular origin of Tuft cells and the molecular mechanisms that govern their specification in severe lung injury. First the authors show ectopic emergence of Tuft cells in airways and distal parenchyma following different injuries. The authors also used lineage tracing models and uncovered that p63-expressing cells and to some extent Scgb1a1-lineaged labeled cells contribute to tuft cells after injury. Further, the authors modulated multiple pathways and claim that Notch inhibition blocks tuft cells whereas Wnt inhibition enhances Tuft cell development in basal cell cultures. Finally, the authors used Trpm5 and Pou2f3 knock-out models to claim that tuft cells are indispensable for alveolar regeneration.

      In summary, the findings described in this manuscript are somewhat preliminary. The claim that the cellular origin of Tuft cells in influenza infection was not determined is incorrect. Current data from pathway modulation is preliminary and this requires genetic modulation to support their claims.

      We thank the reviewer for the comments and we have performed extensive experiments to address the reviewer’s comments. In the revised manuscript we provide additional data including genetic modulation findings to support our model.

      Major comments:

      1) The abstract sounds incomplete and does not cover all key aspects of this manuscript. Currently, it is mainly focusing on the cellular origin of Tuft cells and the role of Wnt and notch signaling. However, it completely omits the findings from Trpm5 and Pou2f3 knock-out mice. In fact, the title of the manuscript highlights the indispensable nature of tuft cells in alveolar regeneration.

      We have modified the abstract and title accordingly.

      2) In lines 93-94, the authors state that "It is also unknown what cells generate these tuft cells.....". This statement is incorrect. Rane et al., 2019 used the same p63-creER mouse line and demonstrated that all tuft cells that ectopically emerge following H1N1 infection originate from p63+ lineage labeled basal cells. Therefore, this claim is not new.

      We thank the reviewer’s comment. Although Rane et al. reported the p63-expressing lineage-negative epithelial stem/progenitor cells (LNEPs) could contribute to the ectopic tuft cells after PR8 virus infection, it is still not clear whether the p63+ cells immediately give rise to tuft cells or though EBCs. Thus, we performed TMX injection after PR8 infection, different from Rane et al (Rane et al., 2019). who performed Tmx injection before viral infection to indicate the ectopic tuft cells are derived from EBCs, as shown in revised Figure 2.

      3) Lines 152-153 state that "21.0% +/- 2.0 % tuft cells within EBCs are labeled with tdT when examined at 30 dpi...". It is not clear what the authors meant here ("within EBC's")? And also, the same sentence states that "......suggesting that club cell-derived EBCs generate a portion of tuft cells....". In this experiment, the authors used club cell lineage tracing mouse lines. So, how do the authors know that the club cell lineage-derived tuft cells came through intermediate EBC population? Current data do not show evidence for this claim. Is it possible that club cells can directly generate tuft cells?

      We apologize for the confusion and revised the text accordingly. Here, “within EBCs” means within the “pods” area where p63+ basal cells are ectopically present. The sentence is revised as “21.0% +/- 2.0 % tuft cells that are ectopically present in the parenchyma are labeled by tdT. Notably, these lineage labeled tuft cells were co-localized with EBCs.” We don’t know whether the club cell lineage-derived tuft cells transit through intermediate EBCs and that is why we use “suggest”. It is also possible that club cells can directly generate tuft cells. To avoid the confusion, we delete the sentence.

      4) Based on the data from Fig-3A, the authors claim that treatment with C59 significantly enhances tuft cell development in ALI cultures. Porcupine is known to facilitate Wnt secretion. So, which cells are producing Wnt in these cultures? It is important to determine which cells are producing Wnt and also which Wnt? Further, based on DBZ treatments, it appears that active Notch signaling is necessary to induce Tuft cell fate in basal cells. Where are Notch ligands expressed in these tissues? Is Notch active only in a small subset of basal cells (and hence generate rate tuft cells)? This is one of the key findings in this manuscript. Therefore, it is important to determine the expression pattern of Wnt and Notch pathway components.

      We thank the reviewer’s interesting questions and agree the importance of identifying the specific ligands and receptors for relevant Wnt and Notch signaling during tuft cell derivation. That being said, we think the topic is beyond the scope of this study which is focused on the role of tuft cells in alveolar regeneration. The point is well taken and we will investigate the topic in our future study.

      5) How do the authors explain different phenotypes observed in Trpm5 knockout and Pou2f3 mutants? Is it possible that Trpm5 knockout mice have a subset of tuft cells and that they might be something to do with the phenotypic discrepancy between two mutant models?

      Again we thank the reviewer for the interesting question. As discussed in the discussion section, Trpm5 is also reported to be expressed in B lymphocytes (Sakaguchi et al., 2020). It is possible that loss of Trpm5 modulates the inflammatory responses following viral infection, which may contribute to improved alveolar regeneration. However, it is also possible that Trpm5-/- mice keep a subset of tuft cells that facilitate lung regeneration as suggested by the reviewer.

      6) One of the key findings in this manuscript is that Wnt and Notch signaling play a role in Tuft cell specification. All current experiments are based on pharmacological modulation. These need to be substantiated using genetic gain loss of function models.

      We have performed the genetic studies.

      Reviewer #2 (Public Review):

      In this manuscript, the authors describe the ectopic differentiation of tuft cells that were derived from lineage-tagged p63+ cells post influenza virus infection. These tuft cells do not appear to proliferate or give rise to other lineages. They then claim that Wnt inhibitors increase the number of tuft cells while inhibiting Notch signaling decreases the number of tuft cells within Krt5+ pods after infection in vitro and in vivo. The authors further show that genetic deletion of Trpm5 in p63+ cells post-infection results in an increase in AT2 and AT1 cells in p63 lineage-tagged cells compared to control. Lastly, they demonstrate that depletion of tuft cells caused by genetic deletion of Pou2f3 in p63+ cells has no effect on the expansion or resolution of Krt5+ pods after infection, implying that tuft cells play no functional role in this process.

      Overall, in vivo and in vitro phenotypes of tuft cells and alveolar cells are clear, but the lack of detailed cellular characterization and molecular mechanisms underlying the cellular events limits the value of this study.

      We thank the reviewer for the comments and acknowledging that our findings are clear. In the revised manuscript we provide more detailed characterization and genetic evidence to elucidate the role of tuft cells in lung regeneration.

      1) Origin of tuft cells: Although the authors showed the emergence of ectopic tuft cells derived from labelled p63+ cells after infection, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitors, as previously reported, can also contribute to tuft cell expansion (Rane et al. 2019; by labelling p63+ cells prior to infection, they showed that the majority of ectopic tuft cells are derived from p63+ cells after viral infection). It would be more informative if the authors show the differentiation of tuft cells derived from p63+Krt5+ cells by tracing Krt5+ cells after infection, which will tell us whether ectopic tuft cells are differentiated from ectopic basal cells within Krt5+ pods induced by virus infection.

      We thank the reviewer for the helpful suggestion. We have performed the experiment accordingly.

      2) Mechanisms of tuft cell differentiation: The authors tried to determine which signaling pathways regulate the differentiation of tuft cells from p63+ cells following infection. Although Wnt/Notch inhibitors affected the number of tuft cells derived from p63+ labelled cells, it remains unclear whether these signals directly modulate differentiation fate. The authors claimed that Wnt inhibition promotes tuft cell differentiation from ectopic basal cells. However, in Fig 3B, Wnt inhibition appears to trigger the expansion of p63+Krt5+ pod cells, resulting in increased tuft cell differentiation rather than directly enhancing tuft cell differentiation. Further, in Fig 3D, Notch inhibition appears to reduce p63+Krt5+ pod cells, resulting in decreased tuft cell differentiation. Importantly, a previous study has reported that Notch signalling is critical for Krt5+ pod expansion following influenza infection (Vaughan et al. 2015; Xi et al. 2017). Notch inhibition reduced Krt5+ pod expansion and induced their differentiation into Sftpc+ AT2 cells. In order to address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs, the authors should provide a more detailed characterization of cellular composition (Krt5+ basal cells, club cells, ciliated cells, AT2 and AT1 cells, etc.) and activity (proliferation) within the pods with/without inhibitors/activators.

      Again we thank the reviewer for the insightful suggestions. We agree that it will be interesting to further address the direct effect of Wnt/Notch signaling in the differentiation process of tuft cells from EBCs. In this revised manuscript we added new findings of EBC differentiation into tuft cells in mice with genetic deletion of Rbpjk.

      3) Impact of Trpm5 deletion in p63+ cells: It is interesting that Trpm5 deletion promotes the expansion of AT2 and AT1 cells derived from labelled p63+ cells following infection. It would be informative to check whether Trpm5 regulates Hif1a and/or Notch activity which has been reported to induce AT2 differentiation from ectopic basal cells (Xi et al. 2017). Although the authors stated that there was no discernible reduction in the size of Krt5+ pods in mutant mice, it would be interesting to investigate the relationship between AT2/AT1 cell retaining pods and the severity of injury (e.g. large Krt5+ pods retain more/less AT2/AT1 cells compared to small pods. What about other cell types, such as club and goblet cells, in Trpm5 mutant pods? Again, it cannot be ruled out that pre-existing p63+Krt5- intrapulmonary progenitor cells can directly convert into AT2/AT1 cells upon Trpm5 deletion rather than p63+Krt5+ cells induced by infection.

      We thank the reviewer for the comments and suggestions. Our new data using KRT5-CreER mouse line confirmed that pod cells (Krt5+) do not contribute to AT2/AT1 cells, consistent with previous studies (Kanegai et al., 2016; Vaughan et al., 2015). Our data also show that p63-CreER lineage labeled AT2/AT1 cells are separated from pod cell area, suggesting pod cells and these AT2/AT1 cells are generated from different cell of origin. We also checked the Notch activity in pod cells in Trpm5-/- mice, and some pod cell-derived cells are Hes1 positive, whereas some are Hes1 negative (RLFigure 1). As indicated in discussion we think that AT2/AT1 cells are possibly derived from pre-existing AT2 cells that transiently express p63 after PR8 infection. It will be interesting to test whether Trpm5 regulates Hif1a in this population (p63+,Krt5-), and this will be our next plan.

      RLFigure 1. Representative area staining in Trpm5-/- mice at 30 dpi. Area 1: Notch signaling is active (Hes1+, arrows) in pod cells following viral infection. Area 2: pod cells exhibit reduced Notch activities. Note few Hes1+ cells in pods (arrows). Scale bar: 50 µm.

      4) Ectopic tuft cells in COVID-19 lungs: The previous study by the authors' group revealed the presence of ectopic tuft cells in COVID-19 patient samples (Melms et al. 2021). There appears to be no additional information in this manuscript.

      In Melms et al., Nature, 2021 (Melms et al., 2021), we showed tuft cell expansion in COVID-19 lungs but not the potential origin of tuft cells. In this manuscript we show some cells co-expressing POU2F3 and KRT5, suggesting a pod-to-tuft cell differentiation.

      5) Quantification information and method: Overall, the quantification method should be clarified throughout the manuscript. Further, in the method section, the authors stated that the production of various airway epithelial cell types was counted and quantified on at least 5 "random" fields of view. However, virus infection causes spatially heterogeneous injury, resulting in a difficult to measure "blind test". The authors should address how they dealt with this issue.

      We clarified that quantification method as suggested. For the in vitro cell culture assays on the signaling pathways, we took pictures from at least five random fields of view for quantification. For lung sections, we tile-scanned the lung sections including at least three lung lobes and performed quantification.

      Reviewer #3 (Public Review):

      In this manuscript Huang et al. study how the lung regenerates after severe injury due to viral infection. They focus on how tuft cells may affect regeneration of the lung by ectopic basal cells and come to the conclusion that they are not required. The manuscript is intriguing but also very puzzling. The authors claim they are specifically targeting ectopic basal progenitor cells and show that they can regenerate the alveolar epithelium in the lung following severe injury. However, it is not clear that the p63-CreERT2 line the authors are using only labels ectopic basal cells. The question is what is a basal cell? Is an ectopic basal progenitor cell only defined by Trp63 expression?

      The accompanying manuscript by Barr et al. uses a Krt5-CreERT2 line to target ectopic basal cells and using that tool the authors do not see a signification contribution of ectopic basal cells towards alveolar epithelial regeneration. As such the claim that ectopic basal cell progenitors drive alveolar epithelial regeneration is not well-founded.

      We appreciate the reviewer for the positive comments and agreeing that our findings are interesting.

      The title itself is also not very informative and is a bit misleading. That being said I think the manuscript is still very interesting and can likely easily be improved through a better validation of which cells the p63-CreERT2 tool is targeting.

      We have revised the title accordingly and performed extensive experiments to address the reviewer’s concerns.

      I, therefore, suggest the following experiments.

      1) Please analyze which cells p63-CreERT2 labels immediately after PR8 and tamoxifen treatment. Are all the tdTomato labeled cells also Krt5 and p63 positive or are some alveolar epithelial cells or other airway cell types also labeled?

      We thank the reviewer for the question. To answer the reviewer’s question, we performed PR8 infection (250 pfu) on three Trp63-CreERT2;R26tdT mice and TMX treatment at days 5 and 7 post viral infection. We didn't perform TMX injection immediately as the mice were sick at a few days post infection. The lung samples were collected at 14 dpi. We observed that tdT+ cells are present in the airways (rebuttal letter RLFigure 2A, B), and it appears that the lineage labeled cells (tdT+) include club cells (CC10+) that are underlined by tdT+Krt5+ basal cells (RLFigure 2C). We think that these labeled basal cells give rise to club cells. However, we also noticed that rare club cells and ciliated cells (FoxJ1+) are labeled by tdT in the areas absent of surrounding tdT+ basal cells (RLFigure 2D). Moreover, a minor population of tdT+ SPC+ cells are present in the terminal airways that were disrupted by viral infection (RLFigure 2E and D). We did not see any pods formed in this experiment and we did not observe any tdT+ cells in the intact alveoli (uninjured area).

      RLFigure 2. Trp63-CreERT2 lineage labeled cells in the airways but not alveoli when Tamoxifen was induced at day 5 and 7 after PR8 H1N1 viral infection. Trp63-CreERT2;R26-tdT mice were infected with PR8 at 250 pfu and Tmx were delivered at a dose of 0.25 mg/g bodyweight by oral gavage. Lung samples were collected and analyzed at 14 dpi. Stained antibodies are as indicated. Scale bar: 100 µm.

      2) Please also show if p63-CreERT2 labels any cells in the adult lung parenchyma in the absence of injury after tamoxifen treatment.

      Dr. Wellington Cardoso’s group demonstrated that Trp63-CreERT2 only labels very few cells in the airways but not the lung parenchyma in the absence of injury after tamoxifen treatment (Yang et al., 2018). Dr. Ying Yang has revisited the data and she did not observe any labeling in the lung parenchyma (n = 2).

      3) Please analyze if p63-CreERT2 labels any cells with tdTomato in the absence of injury or after PR8 infection but without tamoxifen treatment.

      We performed the experiment and didn't observe any labeled cells in the lung parenchyma without Tamoxifen treatment (n = 4).

      4) Please analyze when after PR8 infection do the first p63-CreERT2 labeled tdTomato positive alveolar epithelial cells appear.

      We administered tamoxifen at day 5 and 7 after PR8 infection and harvested lung tissues at day 14. As shown in Figure 1, we observed a few tdT+ SPC+ cells in the terminal airways that are disrupted by viral infection. Notably, we did not observe any lineage labeled cells in the intact alveoli (uninjured) in this experiment..

      5) A clonal analysis of p63-CreERT2 labeled cells using a confetti reporter might also help interpret the origin of p63-CreERT2 labeled cells.

      We thank the reviewer for the suggestion. Our new data demonstrate that a rare population of SPC+tdT+ cells are present in the disrupted terminal airways of Trp63-CreERT2;R26tdT mice. Our data in the original manuscript and the new data suggest that the initial SPC+;tdT+ cells are rare because we have to administrate multiple doses of Tamoxifen to label them. Given the less labeling efficiency of confetti than R26tdT mice, it is possible we will not be able to label these SPC+ cells. Moreover, our original manuscript clearly shows individual clones of SPC+tdT+ cells in the regenerated lung, and they do not seem to compose of multiple clones. Therefore we think that use of confetti mice may not add new information..

      6) Lastly could the authors compare the single-cell RNAseq transcription profile of p63-CREERT2 labeled cells immediately after PR8 and tamoxifen treatment and also at 60dpi. A pseudotime analysis and trajectory interference analysis could help elucidate the identity of p63-CreERT2 labeled cells that are actually not ectopic basal progenitor cells.

      We appreciated the reviewer’s suggestion and agree that single cell RNA sequencing with pseudotime analysis can provide further information regarding the origin of the lineage labeled alveolar cells of Trp63-CreERT2;R26tdT mice. That said, our new data clearly show that KRT5-CreER lineage labeled cells do not give rise to AT1/2 cells as previously described (Kanegai et al., 2016; Vaughan et al., 2015), suggesting that the ectopic basal progenitor cells do not generate alveolar cells. By contrast, Trp63-CreERT2 lineage labeled cells do give rise to AECs, suggesting that this p63+ cell population capable of generating AECs are different from Krt5+ ectopic basal progenitor cells. Our single cell core has an extremely long waiting list due to the pandemic and we hope that our new findings are enough to address the reviewer’s concern without the need of single cell analysis..

    1. Author Response

      Reviewer #1 (Public Review):

      1-1. I do have some concerns that the differences in network clustering reported in Fig 6 may be due to noise and I think the comparisons against the HCP parcellation could be more robust. Specifically, with regard to the network clustering in Fig 6. The authors use a clustering algorithm (which is not explained) to cluster the parcels into different functional networks. They achieve this by estimating the mean time series for each parcel in each individual, which they then correlate between the n regions, to generate an nxn connectivity matrix. This they then binarise, before averaging across individuals within an age group. It strikes me that binarising before averaging will artificially reduce connections for which only a subset of individuals are set to zero. Therefore averaging should really occur before binarising. Then I think the stability of these clusters should be explored by creating random repeat and generation groups (as done for the original parcells) or just by bootstrapping the process. I would be interested to see whether after all this the observation that the posterior frontoparietal expands to include the parahippocampal gryus from 3-6 months and then disappears at 9 months - remains.

      We thank the reviewer for this insightful comment on our clustering process. For the step of “binarizing before averaging”, we followed the method proposed by Yeo et al (1). In this method, all correlation matrices are binarized according to the individual-specific thresholds. Specifically, each individual-specific threshold is determined according to the percentile, and only 10% of connections are kept and set to 1, while all other connections are set to 0. Yeo et al. (1) explained their motivation for doing so as “the binarization of the correlation matrix leads to significantly better clustering results, although the algorithm appears robust to the particular choice of the threshold”. We consider that the possible reason is that the binarization of connectivity in each individual offers a certain level of normalization so that each subject can contribute the same number of connections. If averaging occurs before binarizing, the actual connectivity contributed by different subjects would be different, which leads to bias. Meanwhile, we tested the stability of ‘binarizing first’ and ‘averaging first’, and the result is shown in Fig. R1 below. This figure suggests a similar conclusion as (1), where binarizing first before averaging leads to better clustering stability. We added the motivation of binarizing before averaging in the revised manuscript between line 577 and line 581.

      Fig. R1. The comparison of clustering stability of different methods. The red line refers to the clustering stability when binarizing the correlation matrices first and then averaging the matrices across individuals, while the blue line refers to the clustering stability when averaging the correlation matrices across individuals first and then binarizing the average matrix.

      For the final clustering results, we performed our clustering method using bootstrapping 100 times, and the final result is a majority voting of each parcel. The comparison of these two results is shown in Fig. R2. Overall, we do observe good repeatability between these two results. However, we also observed that some parcels show different patterns between the two results, especially for those parcels that are spatially located around the boundaries of networks or the medial wall. The pattern of the observation that “the posterior frontoparietal expands to include the parahippocampal gyrus from 3-6 months and then disappears at 9 months – remains” was not repeated in the bootstrapped results. These results might suggest that the clustering method is quite robust, the discovered patterns are relatively stable, and the differences between our original results and bootstrapping results might be caused by noises or inter-subject variabilities.

      Fig. R2. Top panel: the network clustering results using all data in the original manuscript. Bottom panel: the network clustering results using majority voting through 100 times of bootstrapping. Black circles and red arrows point to the parahippocampal gyrus, which was included in the posterior frontoparietal network, and is not well repeated in the bootstrapped results. (M: months)

      1-2. Then with regard to the comparison against the HCP parcellation, this is only qualitative. The authors should see whether the comparison is quantitatively better relative to the null clusterings that they produce.

      Thank you for this great suggestion! As suggested, we added this quantitative comparison using the Hausdorff distance. Similar to the comparison in parcel variance and homogeneity, the 1,000 null parcellations were created by randomly rotating our parcellation with small angles on the spherical surface 1,000 times. We compared our parcellation and the null parcellations by accordingly evaluating their Hausdorff distances to some specific areas of the HCP parcellation on the spherical space, including Brodmann's area 2, 3b, 4+3a, 44+45, V1, and MT+MST. The results are listed in Figure 4. From the results, we can observe that our parcellation generally shows statistically much lower Hausdorff distances to the HCP parcellation, suggesting that our parcellation generates parcel borders that are closer to HCP parcellations compared to the null parcellations.

      However, we noticed very few null parcellations that show smaller Hausdorff distances compared to our parcellation. A possible reason comes from our surface registration process with the HCP template purely based on cortical folding, without using functional gradient density maps, which are not available in the HCP template. As a result, this does not ensure high-quality functional alignment between our infant data and the HCP space, thus inevitably increasing the Hausdorff distance between our parcellation and the HCP parcellation.

      1-3. … not all individuals appear (from Fig 8) to be acquired exactly at the desired timepoints, so maybe the authors might comment on why they decided not to apply any kernel weighted or smoothing to their averaging? Pg. 8 'and parcel numbers show slight changes that follow a multi-peak fluctuation, with inflection ages of 9 and 18 months' explain - the parcels per age group vary - with age with peaks at 9 and 18 - could this be due to differences in the subject numbers, or the subjects that were scanned at that point?

      We do agree with the reviewer that subjects are not scanned at similar time points. This is designed in the data acquisition protocol to seamlessly cover the early postnatal stage so that we will have a quasi-continuous observation of the dynamic early brain development.

      We didn’t apply kernel weighted average or smoothing when generating the parcellation, as we would like each scan to contribute equally, and each parcellation map could be representative of the cohort of the covered age, instead of only part of them. Meanwhile, our final ‘age-common parcellation’ could be representative of all subjects from birth to 2 years of age. However, we do agree that the parcellation map that is only designed for the use of a specific age, e.g., 1-year-olds, kernel weighted average, or even a more restricted age range could be a more appropriate solution.

      For the parcel number that likely shows fluctuations with subject numbers, we added an experiment, where we randomly selected 100 scans by considering the minimum scan number in each age group using bootstrapping and repeated this process 100 times. The average parcel number of each age is reported in the following Table R1. We didn’t observe strong changes in parcel numbers when reducing scan numbers, which further demonstrates that our parcel numbers do not show a strong relation to subject numbers. However, the parcel number does not increase greatly from 18M to 24M in the bootstrapping results, so we modified the statement in the manuscript about the parcel number to ‘… all parcel numbers fall between 461 to 493 per hemisphere, where the parcel number attains a maximum at around 9 months and then reduces slightly and remains relatively stable afterward. …’, which can be found between line 121 and line 122.

      1-4. I also have some residual concerns over the number of parcels reported, specifically as to whether all of this represents fine-grained functional organisation, or whether some of it represents noise. The number of parcels reported is very high. While Glasser et al 2016 reports 360 as a lower bound, it seems unlikely that the number of parcels estimated by that method would greatly exceed 400. This would align with the previous work of Van Essen et al (which the authors cite as 53) which suggests a high bound of 400 regions. While accepting Eickhoff's argument that a more modular view of parcellation might be appropriate, these are infants with underdeveloped brain function.

      We thank the reviewer for this insightful comment. We agree that there might be noises for some of the parcels, as noises exist in each step, such as data acquisition, image processing, surface reconstruction, and registration, especially considering functional MRI is noisier than structural MRI. Though our experiments show that our parcellation is fine-grained and is suitable for the study of the infant brain functional development, it is hard to directly quantitatively validate as there is no ground truth available.

      Despite these, we are still motivated to create fine-grained parcellations, as with the increase of bigger and higher resolution imaging data and advanced computational methods, parcellations with more fine-grained regions are desired for downstream analyses, especially considering the hierarchical nature of the brain organization (2). And the main reason that our method generates much finer parcellation maps, is that both our registration and parcellation process is based on the functional gradient density, which characterizes a fine-grained feature map based on fMRI. This leads to both better inter-subject alignment in functional boundaries and finer region partitions. This strategy is different from Glasser et al (3), which jointly considers multimodal information for defining parcel boundaries, thus parcels revealed purely by functional MRI might be ignored in the HCP parcellation. We hope our parcellation framework can be a useful reference for this research direction. We added this discussion in the revised manuscript between line 268 and line 271.

      For the parcel number, even without performing surface registration based on fine-grained functional features, recent adult fMRI-based parcellations greatly increased parcel numbers, such as up to 1,000 parcels in Schaefer et al. (4), 518 parcels in Peng et al. (5), and 1,600 parcels in Zhao et al. (6). For infants, we do agree that the infant functional connectivity might not be as strong as in adults. However, there are opinions (7-9) that the basic units of functional organization are likely to present in infant brains, and brain functional development gradually shapes the brain networks. Therefore, the functional parcel units in infants could be possibly on a comparable scale to adults. Even so, we do agree that more research needs to be performed on larger datasets for better evaluations. We added this discussion in the revised manuscript between line 275 and line 280.

      1-5. Further comparisons across different subjects based on small parcels increases the chances of downstream analyses incorporating image registration noise, since as Glasser et al 2016 noted, there are many examples of topographic variation, which diffeomorphic registration cannot match. Therefore averaging across individuals would likely lose this granularity. I'm not sure how to test this beyond showing that the networks work well for downstream analyses but I think these issues should be discussed.

      We agree with the reviewer that averaging across individuals inevitably brings some registration errors to the parcellation, especially for regions with high topographic variation across subjects, which would lead to loss of granularity in these regions. We believe this is an important issue that exists in most methods on group-level parcellations, and the eventual solution might be individualized parcellation, which will be our future work. We added this discussion in the revised manuscript between line 288 and line 292.

      We also agree with the reviewer that downstream analyses are important evaluations for parcellations. We provided a beta version of our parcellation with 602 parcels (10) to our colleagues, and they tested our parcellation in the task of infant individual recognition across ages using functional connectivity, to explore infant functional connectome fingerprinting (10). We compared the performance of different parcellations with 602 ROIs (our beta version), 360 ROIs (HCP MMP parcellation (3)), and 68 ROIs (FreeSurfer parcellation (11)). The results (Fig. R3) show that our parcellation with a higher parcellation number yields better accuracy compared to other parcellations. We added a description of this downstream application in the discussion between line 284 and line 287.

      Fig. R3. The comparison of different parcellations for infant individual recognition across age based on functional connectivity (figure source: Hu et al. (10)). The parcellation with 602 ROIs is the beta version of our parcellation, 360 ROIs stands for HCP MMP parcellation (3) and 68 ROIs stands for the FreeSurfer parcellation (11). This downstream task shows that a higher parcellation number does lead to better accuracy in the application.

      1-6. Finally, I feel the methods lack clarity in some areas and that many key references are missing. In general I don't think that key methods should be described only through references to other papers. And there are many references, particular to FSL papers, that are missing.

      We thank the reviewer for this great suggestion. We added related references for FLIRT, FSL, MCFLIRT, and TOPUP For the alignment to the HCP 32k_LR space, we first aligned all subjects to the fsaverage space using spherical demons, and then used part of the HCP pipeline (12) to map the surface from the fsaverage space to HCP 164k_LR space, and downsampled to 32k_LR space. We modified this citation by referencing the HCP pipeline by Glasser et al. (12) instead and detailed this registration process in the revised manuscript between line 434 to line 440 in the revised manuscript and as below:

      “… The population-mean surface maps were mapped to the HCP 164k ‘fs_LR’ space using the deformation field that deforms the ‘fsaverage’ space to the ‘fs_LR’ space released by Van Essen et al. (13), which was obtained by landmark-based registration. By concatenating the three deformation fields of steps 1, 3, and 4, we directly warped all cortical surfaces from individual scan spaces to the HCP 164k_LR space and then resampled them to 32k_LR using the HCP pipeline (12), thus establishing vertex-to-vertex correspondences across individuals and ages …”

      Reviewer #2 (Public Review):

      2-1. Diminishing enthusiasm is the lack of focus in the result section, the frequent use of jargon, and figures that are often difficult to interpret. If those issues are addressed, the proposed atlas could have a high impact in the field especially as it is aligned with the template of the Human Connectome Project.

      We’d like to thank Reviewer #2 for the appreciation of our atlas. According to the reviewer’s suggestion, we went through the manuscript again by focusing on correcting the use of jargon, clarity in the result section, as well as figures and figure captions. We hope our corrections can help explain our work to a broader community. Our revisions are accordingly detailed in the following. Meanwhile, our parcellation maps have been aligned with the templates in HCP and FreeSurfer and made available via NITRC at: https://www.nitrc.org/projects/infantsurfatlas/.

      References

      1. B. Thomas Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology 106, 1125-1165 (2011).

      2. S. B. Eickhoff, R. T. Constable, B. T. Yeo, Topographic organization of the cerebral cortex and brain cartography. NeuroImage 170, 332-347 (2018).

      3. M. F. Glasser, T. S. Coalson, E. C. Robinson, C. D. Hacker, J. Harwell, E. Yacoub, K. Ugurbil, J. Andersson, C. F. Beckmann, M. Jenkinson, S. M. Smith, D. C. Van Essen, A multi-modal parcellation of human cerebral cortex. Nature 536, 171-178 (2016).

      4. A. Schaefer, R. Kong, E. M. Gordon, T. O. Laumann, X.-N. Zuo, A. J. Holmes, S. B. Eickhoff, B. T. J. C. C. Yeo, Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. 28, 3095-3114 (2018).

      5. L. Peng, Z. Luo, L.-L. Zeng, C. Hou, H. Shen, Z. Zhou, D. Hu, Parcellating the human brain using resting-state dynamic functional connectivity. Cerebral Cortex, (2022).

      6. J. Zhao, C. Tang, J. Nie, Functional parcellation of individual cerebral cortex based on functional mri. Neuroinformatics 18, 295-306 (2020).

      7. W. Gao, S. Alcauter, J. K. Smith, J. H. Gilmore, W. Lin, Development of human brain cortical network architecture during infancy. Brain Structure and Function 220, 1173-1186 (2015).

      8. W. Gao, H. Zhu, K. S. Giovanello, J. K. Smith, D. Shen, J. H. Gilmore, W. J. P. o. t. N. A. o. S. Lin, Evidence on the emergence of the brain's default network from 2-week-old to 2-year-old healthy pediatric subjects. 106, 6790-6795 (2009).

      9. K. Keunen, S. J. Counsell, M. J. J. N. Benders, The emergence of functional architecture during early brain development. 160, 2-14 (2017).

      10. D. Hu, F. Wang, H. Zhang, Z. Wu, Z. Zhou, G. Li, L. Wang, W. Lin, G. Li, U. U. B. C. P. Consortium, Existence of Functional Connectome Fingerprint during Infancy and Its Stability over Months. Journal of Neuroscience 42, 377-389 (2022).

      11. R. S. Desikan, F. Ségonne, B. Fischl, B. T. Quinn, B. C. Dickerson, D. Blacker, R. L. Buckner, A. M. Dale, R. P. Maguire, B. T. Hyman, An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31, 968-980 (2006).

      12. M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80, 105-124 (2013).

    1. Author Response:

      Reviewer #1 (Public Review):

      The key question that the authors were addressing was how ethnicity differentially affects the microbiota of subjects living in a particular area (in this case East Asians and Caucasians living in San Francisco that have been enrolled in an 'Inflammation, Diabetes, Ethnicity and Obesity cohort - although inflammatory disease was apparently excluded in these subjects).

      The existence of differences between different populations allows potential discrimination of the underlying factors - such as host genetics, diet, lifestyle, physiological parameters, body habitus or other environmental influences. In this case body habitus has been selected as a stratification factor between the two ethnicities. Immigration potentially allows distinction of environmental and host genetical influences.

      The strength of the study is in the level of robust analysis of the microbiotas by a very experienced group of researchers, distinguishing the microbiota differences, especially in lean subject, with analysis of associations that may be driving the differences. It is interesting that diet is not one of the apparent associations in this study, yet the relationship of microbiota diversity to body habitus is strong in Caucasian subjects. These associations cannot easily be extrapolated to causation or mechanism - a fact well recognized in the paper - but remain important observations that rationalize in vivo modeling with experimental animals or in vitro analyses of microbial interactions between different taxa simulating the context of differences in the intestinal milieu. The paper includes work showing that differences of the microbiota can be recapitulated after transfer to germ-free mice, at least over the short term: this is important to provide tools to model the reasons for differences in consortial composition.

      A very large amount of work required to assemble the samples and the clinical phenotypic metadata set making the data an important and definitive contribution for the subjects studied. Of course, this is one sample of extremely variable human conditions and lifestyles that will help build the overall picture of how differences in our genetics and environment shape our intestinal microbiota.

      We appreciate the reviewers' positive summary of our manuscript and agree with the reviewer’s assessment of the need for both mechanistic follow-on studies and extensions to larger and more diverse cohorts.

      Reviewer #2 (Public Review):

      The study's primary aims are to test for the differences in the microbiome between self-identified East Asian and White subjects from the San Francisco area in the new IDEO cohort. The study builds on an growing literature which describes variations among ethnic groups. The major conclusion of "emphasize the utility of studying diverse ethnic groups" is not novel to the literature.

      It was not our intention to imply that our study is novel in studying two distinct ethnic groups, but rather to emphasize that differences exist between ethnicities with regard to the gut microbiome and to provide a systematic analysis of this including gnotobiotic mouse models along a key health disparity in Asian Americans. We include references of prior examples of this work in our introduction (including several references in our introductory paragraph). We have modified our abstract to clarify this point further:

      “Taken together, our findings add to the growing body of literature describing variation between ethnicities and provide a starting point for defining the mechanisms through which the microbiome may shape disparate health outcomes in East Asians.”

      Overall, the strength of the results is that they confirm patterns from different cohorts/studies and demonstrate that ethnic-related differences are common. The results are subject to sample size concerns that may underpin some of the conflicting or lack of significant results. For instance, there is no overlap in highlighted species-level taxonomy differences between 16S and metagenomic analyses, which precludes a clear interpretation of the meaning of those differences and whether taxa should be highlighted in the abstract; there are low AUC values for the random forest modelling; and there is a lack of significance in correlations between BMI and East Asian subjects in F4a where there may be a correlation. While a minor point, it serves to highlight the sample sizes as the range of the variation in East Asian subjects is not as substantial as the White subjects because there are fewer East Asian data points above a 30 BMI (~N=5) relative to those of White subjects (~N=11).

      We agree that our study was limited by sample size and that future studies increasing sample size would be valuable to assess the intersection of metabolic health in colocalized EA and W subjects. We include this in our discussion:

      “Due to the investment of resources into ensuring a high level of phenotypic information on each cohort member, and due to its restricted geographical catchment area, the IDEO cohort was relatively small at the time of this analysis (n=46 individuals). This study only focused on two of the major ethnicities in the San Francisco Bay Area; as IDEO continues to expand and diversify its membership, we hope to study a sufficient number of participants from other ethnic groups in the future.”

      The microbiome transfers from humans to mice also demonstrate that certain features of interpersonal or ethnic-related differences can be established in mice. This is useful for future studies, but it is not unexpected in and of itself given the robustness of transferring microbiome differences in other human-to-mouse studies. If the phenotype data were more compelling, then the utility of these transfers could be valuable.

      We respectfully disagree with this point. To our knowledge, this is the first study demonstrating that ethnicity-associated differences in the gut microbiota are stable following transplantation, which is certainly not guaranteed given the marked and currently unpredictable variations between donor and recipient microbiotas shown here and in prior studies by us (Nayak et al., 2021; Turnbaugh et al., 2009b) and others (Walter et al., 2020).

      We state this rationale in our results section:

      “Taken together, our results support the hypothesis that there are stable ethnicity-associated signatures within the gut microbiota of lean EA vs. W individuals that are independent of diet. To experimentally test this hypothesis, we transplanted the gut microbiotas of two representative lean W and lean EA individuals into germ-free male C57BL/6J mice…Next, we sought to assess the reproducibility of these findings across multiple donors and in the context of a distinctive dietary pressure. We fed 20 germ-free male mice a high-fat, high-sugar (HFHS) diet for 4 weeks prior to colonization with a gut microbiota from one of 5 W and 5 EA donors....”

      Furthermore, while the phenotypic data may not be as dramatic as the reviewer had hoped, this is to our knowledge the first demonstration that ethnicity-associated differences in the gut microbiota play a causal role in host phenotypes, as highlighted in our discussion:

      “Our results in humans and mouse models support the broad potential for downstream consequences of ethnicity-associated differences in the gut microbiome for metabolic syndrome and potentially other disease areas. However, the causal relationships and how they can be understood in the context of the broader differences in host phenotype between ethnicities require further study.”

      However, in the current state, I am concerned with the experimental design since the LFPP experiments used N=1 donor per ethnicity for establishing the mice colonies and are resultantly confounded by mice pseudo-replication with recipient mice derived from one donor of each ethnicity. This concern is relevant to interpreting results back to interpersonal or interethnic variation. Are phenotypic differences due to individual differences or ethnic differences? It's not clear.

      We presented our data in summary form integrating the results from 3 independent experiments across two figures. To account for pseudoreplication as the reviewer suggests, we have restricted permutational space to account for one donor for multiple recipient mice using the parameters outlined in the adonis software package. Analyzing our results from 3 separate experiments, our results are statistically significant, which we mention in the revised text:

      “In a pooled analysis of all gnotobiotic experiments accounting for one donor for multiple recipient mice, ethnicity and diet were both significantly associated with variations in the gut microbiota (Fig. S9), consistent with the extensive published data demonstrating the rapid and reproducible impact of a HFHS diet on the mouse and human gut microbiota (Bisanz et al., 2019).”

      Figure S9. Combined analysis of recipient mice reveals significant associations with donor ethnicity and recipient diet. A PhILR PCoA is plotted based on 16S-Seq data from all gnotobiotic experiments. Individual mice are colored by (A) donor ethnicity or (B) the recipient’s diet. Both ethnicity and diet were statistically significant contributors to variance (ADONIS p-values and estimated variance displayed using blocks restricted by donor identifiers to account for one donor going to multiple recipient mice). We also observed a trend for interaction between diet and ethnicity in this model (p=0.068, R2=0.047, ADONIS).

      The HFHS experiment also used N=5 donors that somewhat mitigates these concerns, but mixed sexes were used here and there can be sex-specific human microbiome differences.

      Our study was designed to evaluate ethnicity and metabolic health. As we report in our original and updated analysis, we found no significant associations between the gut microbiota and biological sex (Figs. 2E and S4) in the IDEO cohort, perhaps due to the small effect size of sex reported in prior studies by other groups (Arumugam et al., 2011; Ding and Schloss, 2014; Schnorr et al., 2014; Zhang et al., 2021) coupled to the limited size of the current IDEO cohort.

      The Turnbaugh and Koliwad labs use mixed sexes as donors for studies in conventionally raised and gnotobiotic mice due to our active funding from the NIH, which has clear guidelines meant to prevent continued discrimination against studies in females. The following link has additional information for your consideration: https://orwh.od.nih.gov/sex-gender/nih-policy-sex-biological-variable.

      Importantly, our study was not confounded by sex due to the use of similar numbers of male and female donors (2 male and 2 females in the LFPP experiments and 3 female and 2 males for both ethnicities in the HFHS experiment). All of our recipient mice were male, as specified in our methods section and our revised main text:

      “To experimentally test this hypothesis, we transplanted the gut microbiotas of two representative lean W and lean EA individuals into germ-free male C57BL/6J mice…Next, we sought to assess the reproducibility of these findings across multiple donors and in the context of a distinctive dietary pressure. We fed 20 germ-free male mice a high-fat, high-sugar (HFHS) diet for 4 weeks prior to colonization with a gut microbiota from one of 5 W and 5 EA donors....”

      To further investigate any potential sex-specific signal we have stratified our analysis for the HFHS experiment by the gender of the donors (Reviewer Figure 2). This reveals that the significance between ethnicity in the microbiota transplantation experiments is preserved in mice that received stool from male donors (Reviewer Fig. 2A) but not female donors (Reviewer Fig. 2B). In Reviewer Fig. 1 above, LFPP1 and LFPP2 were conducted using different donors of different biological sex. Splitting our LFPP experiments up revealed the consistent signal for ethnicity in microbial community composition that we report above. The small sample sizes in this stratified analysis makes it difficult to conclude that there are reproducible sex-specific differences in the microbiome transplant experiments, but we agree with the reviewer that this question should be more thoroughly explored in future work.

      We have added a brief note to the discussion to emphasize this important point:

      “...differences between the human donor and recipient mouse microbiotas inherent to gnotobiotic transplantation warrant further investigation as do differences in the stability of the gut microbiotas of male versus female donors”

      Reviewer Figure 2. (A,B) Principal coordinate analysis of PhILR Euclidean distances of stool from germ-free recipient mice transplanted with stool microbial communities from (A) male (n=2 EA and n=2 W donors) or (B) female (n=3 EA and n=3 W) donors of either ethnicity and fed a HFHS diet. Significance was assessed by ADONIS. Pairs of germ-free mice receiving the same donor sample are connected by a dashed line (n=2 recipient mice per donor). Experimental designs are shown in Fig. S7.

      Finally, experimental results are not always consistent and sometimes show opposite trends that may be related to the sampling sizes. For instance, fat and lean mass increased and decreased respectively in LFPP, but there were no statistically-similar differences in HFHS. Moreover, the metabolic fat mass outcomes in mice do not match the expected human donor data. For instance, in LFPP1, White subjects had lower fat mass in humans but recipient mice on average gained more fat. It is difficult to reconcile these differences to a biological or sampling scheme reason.

      We wholeheartedly agree with this point and were also surprised that the recipient mouse phenotypes did not match our original hypothesis based upon the observed health disparities between EA and W individuals. These surprising and perhaps counter-intuitive results demand further study and mechanistic dissection. We have tried to capture potential explanations for these findings while highlighting the limitations of our current study in our expanded discussion. With respect to the glucose tolerance data, the lack of a microbiome-driven phenotype might be due to the use of genetically identical mice that are not prone to metabolic illness without significant perturbation. If we had used mice prone to metabolic disease, such as non-obese diabetic (NOD) germ free recipient mice where the microbiome is known to impact the development of diabetes, we may have seen between ethnic differences in glucose tolerance.

      Our revised discussion, with key points underlined is copied below for your convenience:

      “Our results in humans and mouse models support the broad potential for downstream consequences of ethnicity-associated differences in the gut microbiome for metabolic syndrome and potentially other disease areas. However, the causal relationships and how they can be understood in the context of the broader differences in host phenotype between ethnicities require further study. While these data are consistent with our general hypothesis that ethnicity-associated differences in the gut microbiome are a source of differences in host metabolic disease risk, we were surprised by both the nature of the microbiome shifts and their directionality. Based upon observations in the IDEO (Alba et al., 2018) and other cohorts (Gu et al., 2006; Zheng et al., 2011), we anticipated that the gut microbiomes of lean EA individuals would promote obesity or other features of metabolic syndrome. In humans, we did find multiple signals that have been previously linked to obesity and its associated metabolic diseases in EA individuals, including increased Firmicutes (Basolo et al., 2020; Bisanz et al., 2019), decreased A. muciniphila (Depommier et al., 2019; Plovier et al., 2017), decreased diversity (Turnbaugh et al., 2009a), and increased acetate (Perry et al., 2016; Turnbaugh et al., 2006). Yet EA subjects also had higher levels of Bacteroidota and Bacteroides, which have been linked to improved metabolic health (Johnson et al., 2017). More importantly, our microbiome transplantations demonstrated that the recipients of the lean EA gut microbiome had less body fat despite consuming the same diet. These seemingly contradictory findings may suggest that the recipient mice lost some of the microbial features of ethnicity relevant to host metabolic disease or alternatively that the microbiome acts in a beneficial manner to counteract other ethnicity-associated factors driving disease.

      EA subjects also had elevated levels of the short-chain fatty acids propionate and isobutyrate. The consequences of elevated intestinal propionate levels are unclear given the seemingly conflicting evidence in the literature that propionate may either exacerbate (Tirosh et al., 2019) or protect from (Lu et al., 2016) aspects of metabolic syndrome. Clinical data suggests that circulating propionate may be more relevant for disease than fecal levels (Müller et al., 2019), emphasizing the importance of considering both the specific microbial metabolites produced, their intestinal absorption, and their distribution throughout the body. Isobutyrate is even less well-characterized, with prior links to dietary intake (Berding and Donovan, 2018) but no association with obesity (Kim et al., 2019). Unlike SCFAs, we did not identify consistent differences in BCAAs, potentially due to differences in both extraction and standardization techniques inherent to GC-MS and NMR analysis (Cai et al., 2016; Lynch and Adams, 2014; Qin et al., 2012).

      There are multiple limitations of this study. Due to the investment of resources into ensuring a high level of phenotypic information on each cohort member coupled to the restricted geographical catchment area, the IDEO cohort was relatively small at the time of this analysis (n=46 individuals). The current study only focused on two of the major ethnicities in the San Francisco Bay Area. As IDEO continues to expand and diversify its membership, we hope to study a sufficient number of participants from other ethnic groups. Stool samples were collected at a single time point and analyzed in a cross-sectional manner. While we used validated tools from the field of nutrition to monitor dietary intake, we cannot fully exclude subtle dietary differences between ethnicities (Johnson et al., 2019), which could be interrogated through controlled feeding studies (Basolo et al., 2020). Our mouse experiments were all performed in wild-type adult males. The use of a microbiome-dependent transgenic mouse model of diabetes (Brown et al., 2016) would be useful to test the effects of inter-ethnic differences in the microbiome on insulin and glucose tolerance. Additional experiments are warranted using the same donor inocula to colonize germ-free mice prior to concomitant feeding of multiple diets, allowing a more explicit test of the hypothesis that diet can disrupt ethnicity-associated microbial signatures. These studies, coupled to controlled experimentation with individual strains or more complex synthetic communities, would help to elucidate the mechanisms responsible for ethnicity-associated changes in host physiology and their relevance to disease.”

      Reviewer #3 (Public Review):

      The authors aimed to characterise how gut microbiota changes between different ethnic group for bacterial richness and community structure. They also wanted to address how this is associated with ethnic group within a defined geographical location. They have started to their story by comparing the fecal microbiota of relatively small cohort consisting of 46 lean and obese East Asian and White participants living in the San Francisco Bay Area. For that reason they used 16S and shotgun metagenomics. They demonstrated that ethnicity-associated differences in the gut microbiota are stronger in lean individuals and obese did not have a clear difference in the gut microbiota profile between ethnic groups, either suggesting that established obesity or its associated dietary patterns can overwrite long-lasting microbial signatures or alternatively that there is a shared ethnicity-independent microbiome type that predisposes individuals to obesity. The authors did also show the metabolic differences between these ethnic groups and the major differences were in the branched chain amino acid and the short-chain fatty acids. To prove their point, at this stage they have also used different metabolomic methodology. Although some aspects of the work are not very novel, the work does provide additional insights into the effect(s) of ethnicity, current living location and diet on shaping microbiota. Honestly, while reading through the manuscript, I have several questions where I believed that clarification was needed. But somehow, I felt like the authors have been reading my mind every step of the way. At the end of each section whatever I questioned was addressed in the next paragraph There are, however, a few points that I think would like to hear the authors' clarification.

      • The authors pursued the story using 16S data. However, they have shotgun Metagenomics data which gives more power and resolution to microbiota profile. Is there any specific reason why the story was not build with shotgun Metagenomic data? However, if this is the case it will be nice to justify in the text or legend which figure was built with what dataset exactly?

      As discussed above, 16S rRNA gene and metagenomic sequencing both have strengths and weaknesses. For example, 16S-seq is inexpensive and allows analysis of low abundance species, whereas metagenomics permits analysis of gene and pathway abundances of abundant taxa. As requested, we have now expanded Figure 2 (metagenomics) to better match Figure 1 (16S-seq). The type of technology is defined within each legend and the relevant text within our results.

      • Even though the authors mentioned in the discussion that they have not used the same inocula from a donor to different diet, it will be nice if the authors further comments whether they would expect the same results or slightly different results which each different inocula.

      As requested, we have modified the text in our discussion to include these comments:

      “Additional experiments are warranted using the same donor inocula to colonize germ-free mice prior to concomitant feeding of multiple diets, allowing a more explicit test of the hypothesis that diet can disrupt ethnicity-associated microbial signatures. These studies, coupled to controlled experimentation with individual strains or more complex synthetic communities, would help to elucidate the mechanisms responsible for ethnicity-associated changes in host physiology and their relevance to disease.”

      Overall, the study is well executed and claims and conclusions seem relatively well justified by the provided evidence. The findings are interesting for a broad audience of biologists. The findings are interesting for a broad audience of biologists.

    1. Author Response:

      Reviewer #1 (Public Review):

      Overall, the authors have done a nice job covering the relevant literature, presenting a story out of complicated data, and performing many thoughtful analyses.

      However, I believe the paper requires quite major revisions.

      We thank the reviewer for their encouraging assessment of our manuscript. We are grateful for their valuable and especially detailed feedback that helped us to substantially improve our manuscript.

      Major issues:

      I do not believe the current results present a clear, comprehensible story about sleep and motor memory consolidation. As presented, sleep predicts an increase in the subsequent learning curve, but there is a negative relationship between learning curve and task proficiency change (which is, as far as I can tell, similar to "memory retention"). This makes it seem as if sleep predicts more forgetting on initial trials within the subsequent block (or worse memory retention) - is this true? Regardless of whether it is statistically true, there appears another story in these data that is being sacrificed to fit a story about sleep. To my eye, the results may first and foremost tell a circadian (rather than sleep) story. Examining the data in Figure 2A and 2B, it appears that every AM learning period has a higher learning curve (slope) than every PM period. While this could, of course, be due to having just slept, the main story gleaned from such a result is not a sleep effect on retention, which has been the emphasis on motor memory consolidation research in the last couple of decades, but on new learning. The fact that this effect appears present in the first session (juggling blocks 1-3 in adolescents and blocks 1-5 in adults) makes this seem the more likely story here, since it has less to do with "preparing one to re-learn" and more to do with just learning and when that learning is optimal. But even if it does not reach statistical significance in the first session alone, it remains a concern and, in my opinion, should be considered a focus in the manuscript unless the authors can devise a reason to definitively rule it out.

      Here is how I recommend the authors proceed on this point: include all sessions from all subjects into a mixed effect model, predicting the slope of the learning curve with time of day and age group as fixed effects and subjects as random effects:

      learning curve slope ~ AM/PM [AM (0) or PM (1)] + age [adolescent (0) or adult (1)] + (1|subject)

      …or something similar with other regressors of interest. If this is significant for AM/PM status, they should re-try the analysis using only the first session. If this is significant, then a sleep-centric story cannot be defended here at all, in my opinion. If it is not (which could simply result from low power, but the authors could decide this), the authors should decide if they think they can rule out circadian effects and proceed accordingly. I should note that, while to many, a sleep story would be more interesting or compelling, that is not my opinion, and I would not solely opt to reject this paper if it centered a time-of-day story instead.

      The authors need to work out precisely what is happening in the behavior here, and let the physiology follow that story. They should allow themselves to consider very major revisions (and drop the physiology) if that is most consistent with the data. As presented, I am very unclear of what to take away from the study.

      We thank the reviewer for the opportunity to further elaborate on our behavioral results. We agree that the interpretation of the behavior in the complex gross-motor task is not straight forward, which might be partly due to less controllability compared to for example finger-tapping tasks. The reviewer is correct that, initially sleep seems to predict more forgetting on initial trials within the subsequent block given the dip in task proficiency and a resulting increase in steepness of the learning curve after the sleep retention interval. Notably, this dip in performance after sleep has also been reported for finger-tapping tasks (cf. Eichenlaub et al, 2020). The performance dip is also present in the wake first group (Figure 2) after the first interval. This observation suggests that picking up the task again after a period of time comes at a cost. Interestingly, this performance dip is no longer present after the second retention interval indicating that the better the task proficiency the easier it is to pick up juggling again. In other words, juggling has been better consolidated after additional training. Critically, our results show, that participants with higher SO-spindle coupling strength have a lower dip in performance after the retention interval, thus indicating a learning advantage.

      Figure 2

      (A) Number of successful three-ball cascades (mean ± standard error of the mean [SEM]) of adolescents (circles) for the sleep-first (blue) and wake-first group (green) per juggling block. Grand average learning curve (black lines) as computed in (C) are superimposed. Dashed lines indicate the timing of the respective retention intervals that separate the three performance tests. Note that adolescents improve their juggling performance across the blocks. (B) Same conventions as in (A) but for adults (diamonds). Similar to adolescents, adults improve their juggling performance across the blocks regardless of group.

      We discuss the sleep effect on juggling in the discussion section (page 22 – 23, lines 502 – 514):

      "How relevant is sleep for real-life gross-motor memory consolidation? We found that sleep impacts the learning curve but did not affect task proficiency in comparison to a wake retention interval (Figure 2DE). Two accounts might explain the absence of a sleep effect on task proficiency. (1) Sleep rather stabilizes than improves gross-motor memory, which is in line with previous gross-motor adaption studies (Bothe et al, 2019; Bothe et al, 2020). (2) Pre-sleep performance is critical for sleep to improve motor skills (Wilhelm et al, 2012). Participants commonly reach asymptotic pre-sleep performance levels in finger tapping tasks, which is most frequently used to probe sleep effects on motor memory. Here we found that using a complex juggling task, participants do not reach asymptotic ceiling performance levels in such a short time. Indeed, the learning progression for the sleep-first and wake-first groups followed a similar trend (Figure 2AB), suggesting that more training and not in particular sleep drove performance gains."

      If indeed the authors keep the sleep aspect of this story, here are some comments regarding the physiology. The authors present several nice analyses in Figure 3. However, given the lack of behavioral difference between adolescents and adults (Fig 2D), they combine the groups when investigating behavior-physiology relationships. In some ways, then, Figure 3 has extraneous details to the point of motor learning and retention, and I believe the paper would benefit from more focus. If the authors keep their sleep story, I believe Figure 3 and 4 should be combined and some current figure panels in Figure 3 should be removed or moved to the supplementary information.

      We thank the reviewers for their suggestion and we agree that the figures of our manuscript would benefit from more focus. Therefore, we combined Figure 3 and 4 from the original manuscript into a revised Figure 3 in the updated version of the manuscript. In more detail, subpanels that explain our methodological approach can now be found in Figure 3 – figure supplement 1, while the updated Figure 3 now focuses on developmental changes in oscillatory dynamics and SO-spindle coupling strength as well as their relationship to gross-motor learning.

      Updated Figure 3:

      (A) Left: topographical distribution of the 1/f corrected SO and spindle amplitude as extracted from the oscillatory residual (Figure 3 – figure supplement 1A, right). Note that adolescents and adults both display the expected topographical distribution of more pronounced frontal SO and centro-parietal spindles. Right: single subject data of the oscillatory residual for all subjects with sleep data color coded by age (darker colors indicate older subjects). SO and spindle frequency ranges are indicated by the dashed boxes. Importantly, subjects displayed high inter-individual variability in the sleep spindle range and a gradual spindle frequency increase by age that is critically underestimated by the group average of the oscillatory residuals (Figure 3 – figure supplement 1A, right). (B) Spindle peak locked epoch (NREM3, co-occurrence corrected) grand averages (mean ± SEM) for adolescents (red) and adults (black). Inset depicts the corresponding SO-filtered (2 Hz lowpass) signal. Grey-shaded areas indicate significant clusters. Note, we found no difference in amplitude after normalization. Significant differences are due to more precise SO-spindle coupling in adults. (C) Top: comparison of SO-spindle coupling strength between adolescents and adults. Adults displayed more precise coupling than adolescents in a centro-parietal cluster. T-scores are transformed to z-scores. Asterisks denote cluster-corrected two-sided p < 0.05. Bottom: Exemplary depiction of coupling strength (mean ± SEM) for adolescents (red) and adults (black) with single subject data points. Exemplary single electrode data (bottom) is shown for C4 instead of Cz to visualize the difference. (D) Cluster-corrected correlations between individual coupling strength and overnight task proficiency change (post – pre retention) for adolescents (red, circle) and adults (black, diamond) of the sleep-first group (left, data at C4). Asterisks indicate cluster-corrected two-sided p < 0.05. Grey-shaded area indicates 95% confidence intervals of the trend line. Participants with a more precise SO-spindle coordination show improved task proficiency after sleep. Note that the change in task proficiency was inversely related to the change in learning curve (cf. Figure 2D), indicating that a stronger improvement in task proficiency related to a flattening of the learning curve. Further note that the significant cluster formed over electrodes close to motor areas. (E) Cluster-corrected correlations between individual coupling strength and overnight learning curve change. Same conventions as in (D). Participants with more precise SO-spindle coupling over C4 showed attenuated learning curves after sleep.

      and

      Figure 3 - figure supplement 1

      (A) Left: Z-normalized EEG power spectra (mean ± SEM) for adolescents (red) and adults (black) during NREM sleep in semi-log space. Data is displayed for the representative electrode Cz unless specified otherwise. Note the overall power difference between adolescents and adults due to a broadband shift on the y-axis. Straight black line denotes cluster-corrected significant differences. Middle: 1/f fractal component that underlies the broadband shift. Right: Oscillatory residual after subtracting the fractal component (A, middle) from the power spectrum (A, left). Both groups show clear delineated peaks in the SO (< 2 Hz) and spindle range (11 – 16 Hz) establishing the presence of the cardinal sleep oscillations in the signal. (B) Top: Spindle frequency peak development based on the oscillatory residuals. Spindle frequency is faster at all but occipital electrodes in adults than in adolescents. T-scores are transformed to z-scores. Asterisks denote cluster-corrected two-sided p < 0.05. Bottom: Exemplary depiction of the spindle frequency (mean ± SEM) for adolescents (red) and adults (black) with single subject data points at Cz. (C) SO-spindle co-occurrence rate (mean ± SEM) for adolescents (red) and adults (black) during NREM2 and NREM3 sleep. Event co-occurrence is higher in NREM3 (F(1, 51) = 1209.09, p < 0.001, partial eta² = 0.96) as well as in adults (F(1, 51) = 11.35, p = 0.001, partial eta² = 0.18). (D) Histogram of co-occurring SO-spindle events in NREM2 (blue) and NREM3 (purple) collapsed across all subjects and electrodes. Note the low co-occurring event count in NREM2 sleep. (E) Single subject (top) and group averages (bottom, mean ± SEM) for adolescents (red) and adults (black) of individually detected, for SO co-occurrence-corrected sleep spindles in NREM3. Spindles were detected based on the information of the oscillatory residual. Note the underlying SO-component (grey) in the spindle detection for single subject data and group averages indicating a spindle amplitude modulation depending on SO-phase. (F) Grand average time frequency plots (-2 to -1.5s baseline-corrected) of SO-trough-locked segments (corrected for spindle co-occurrence) in NREM3 for adolescents (left) and adults (right). Schematic SO is plotted superimposed in grey. Note the alternating power pattern in the spindle frequency range, showing that SO-phase modulates spindle activity in both age groups.

      Why did the authors use Spearman rather than Pearson correlations in Figure 4? Was it to reduce the influence of the outlier subject? They should minimally clarify and justify this, since it is less conventional in this line of research. And it would be useful to know if the relationship is significant with Pearson correlations when robust regression is applied. I see the authors are using MATLAB, and the robustfit toolbox (https://www.mathworks.com/help/stats/robustfit.html) is a simple way to address this issue.

      We thank the reviewers for their suggestion. We agree that when inspecting the scatter plots it looks like that the correlations could be severely influenced by two outliers in the adult group. Because this is an important matter, we recalculated all previously reported correlations without the two outliers (Figure R4, left column) and followed the reviewer’s suggestion to also compute robust regression (Figure R4, right column) and found no substantial deviation from our original results.

      In more detail, increase in task proficiency resulted in flattening of the learning curve when removing outliers (Figure R4A, rhos = -0.70, p < 0.001) and when applying robust regression analysis (Figure R4B, b = -0.30, t(67) = -10.89, rho = -0.80, p < 0.001). Likewise, higher coupling strength still predicted better task proficiency (mean rho = 0.35, p = 0.029, cluster-corrected) and flatter learning curves after sleep (rho = -0.44, p = 0.047, cluster-corrected) when removing the outliers (Figure R4CE) and when calculating robust regression (Figure R4DF, task proficiency: b = 82.32, t(40) = 3.12, rho = 0.45, p = 0.003; learning curve: b = -26.84, t(40) = -2.96, rho = -0.43, p = 0.005). Furthermore, we calculated spearman rank correlations and cluster-corrected spearman rank correlations in our original manuscript, to mitigate the impact of outliers, even though Pearson correlations are more widely used in the field. Therefore, we still report spearman rank correlations for single electrodes instead of robust correlations as it is more consistent with the cluster-correlation analyses.

      We now use robust trend lines instead of linear trend lines in our scatter plots. Further, we added the correlations without outliers (Figure R4ACE) to the supplements as Figure 2 – figure supplement 1D and Figure 3 – figure supplement 2 FG. These additional analyses are now reported in the results section of the revised manuscript (page 9, lines 186 – 191):

      "[…] we confirmed a strong negative correlation between the change (post retention values – pre retention values) in task proficiency and the change in learning curve after the retention interval (Figure 2F; rhos = -0.71, p < 0.001), which also remained strong after outlier removal (Figure 2 – figure supplement 1D). This result indicates that participants who consolidate their juggling performance after a retention interval show slower gains in performance."

      And (page 16, lines 343 – 346):

      "[…] Furthermore, our results remained consistent when including coupled spindle events in NREM2 (Figure 3 – figure supplement 2E) and after outlier removal (Figure 3 – figure supplement 2FG)."

      Furthermore, we now state that we specifically utilized spearman rank correlations to mitigate the impact of outliers in our analyses in the method section (page 35, lines 808 – 813)::

      "For correlational analyses we utilized spearman rank correlations (rhos; Figure 2F & Figure 3DE) to mitigate the impact of possible outliers as well as cluster-corrected spearman rank correlations by transforming the correlation coefficients to t-values (p < 0.05) and clustering in the space domain (Figure 3DE). Linear trend lines were calculated using robust regression."

      Figure R4

      (A) Spearman rank correlation between task proficiency change and learning curve change collapsed across adolescents (red dot) and adults (black diamonds) after removing two outlier subjects in the adult age group. Grey-shaded area indicates 95% confidence intervals of the robust trend line. (B) Robust regression of task proficiency change and learning curve change of the original sample. (C) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) after outlier removal (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. (D) Robust regression of coupling strength at C4 and task proficiency of the original sample. (E) Same conventions as in (C) but for overnight learning curve change. (F) Same conventions as in (D) but for overnight learning curve change.

      Additionally, with only a single night of recording data, it is impossible to disentangle possible trait-based sleep characteristics (e.g., Subject 1 has high SO-spindle coupling in general and retains motor memories well, but these are independent of each other) from a specific, state-based account (e.g., Subject 1's high SO-spindle coupling on night 1 specifically led to their improved retention or change in learning, etc., and this is unrelated to their general SO-spindle coupling or motor performance abilities). Clearly, many studies face this limitation, but this should be acknowledged.

      We thank the reviewers for their important remark. We agree that it is impossible to make a sound statement about whether our reported correlations represent trait- or state-based aspects of the sleep and learning relationship with the data that we have reported in the manuscript. However, while we are lacking a proper baseline condition without any task engagement, we still recorded polysomnography for all subjects during an adaptation night. Given the expected pronounced differences in sleep architecture between the adaptation nights and learning nights (see Table R3 for an overview collapsed across both age groups), we initially refrained from entering data from the adaptation nights into our original analyses, but we now fully report the data below. Note that the differences are driven by the adaptation night, where subjects first have to adjust to sleeping with attached EEG electrodes in a sleep laboratory.

      Table R3. Sleep architecture (mean ± standard deviation) for the adaptation and learning night collapsed across both age groups. Nights were compared using paired t-tests

      To further clarify whether subjects with high coupling strength have a motor learning advantage (i.e. trait-effect) or a learning induced enhancement of coupling strength is indicative for improved overnight memory change (i.e. state-effect), we ran additional analyses using the data from the adaptation night. Note that the coupling strength metric was not impacted by differences in event number and our correlations with behavior were not influenced by sleep architecture (please refer to our answer of issue #7 for the results).Therefore, we considered it appropriate to also utilize data from the adaptation night.

      First, we correlated SO-spindle coupling strength obtained from the adaptation night with the coupling strength in the learning night. We found that overall, coupling strength is highly correlated between the two measurements (mean rho across all channels = 0.55, Figure R5A), supporting the notion that coupling strength remains rather stable within the individual (i.e. trait), similar to what has been reported about the stable nature of sleep spindles as a “neural finger-print” (De Gennaro & Ferrara, 2003; De Gennaro et al, 2005; Purcell et al, 2017).

      To investigate a possible state-effect for coupling strength and motor learning, we calculated the difference in coupling strength between the two nights (learning night – adaptation night) and correlated these values with the overnight change in task proficiency and learning curve. We identified no significant correlations with a learning induced coupling strength change; neither for task proficiency nor learning curve change (Figure R5B). Note that there was a positive correlation of coupling strength change with overnight task proficiency change at Cz (Figure R5B, left), however it did not survive cluster-corrected correlational analysis (rhos = 0.34, p = 0.15). Combined, these results favor the conclusion that our correlations between coupling strength and learning rather reflect a trait-like relationship than a state-like relationship. This is in line with the interpretation of our previous studies that SO-spindle coupling strength reflects the efficiency and integrity of the neuronal pathway between neocortex and hippocampus that is paramount for memory networks and the information transfer during sleep (Hahn et al, 2020; Helfrich et al, 2019; Helfrich et al, 2018; Winer et al, 2019). For a comprehensive review please see Helfrich et al (2021), which argued that SO-spindle coupling predicts the integrity of memory pathways and therefore correlates with various metrics of behavioral performance or structural integrity.

      Figure R5

      (A) Topographical plot of spearman rank correlations of coupling strength in the adaptation night and learning night across all subjects. Overall coupling strength was highly correlated between the two measurements. (B) Cluster-corrected correlation between learning induced coupling strength changes (learning night – adaptation night) and overnight change in task proficiency (left) as well as learning curve (right). We found no significant clusters, although correlations showed similar trends as our original analyses, with more learning induced changes in coupling strength resulting in better overnight task proficiency and flattened learning curves.

      We have now added the additional state-trait analyses (Figure R5) to the updated manuscript as Figure 3 – figure supplement 2HI and report them in the results section (page 17, lines 361 – 375):

      "Finally, we investigated whether subjects with high coupling strength have a gross-motor learning advantage (i.e. trait-effect) or a learning induced enhancement of coupling strength is indicative for improved overnight memory change (i.e. state-effect). First, we correlated SO-spindle coupling strength obtained from the adaptation night with the coupling strength in the learning night. We found that overall, coupling strength is highly correlated between the two measurements (mean rho across all channels = 0.55, Figure 3 – figure supplement 2H), supporting the notion that coupling strength remains rather stable within the individual (i.e. trait). Second, we calculated the difference in coupling strength between the learning night and the adaptation night to investigate a possible state-effect. We found no significant cluster-corrected correlations between coupling strength change and task proficiency- as well as learning curve change (Figure 3 – figure supplement 2I).

      Collectively, these results indicate the regionally specific SO-spindle coupling over central EEG sensors encompassing sensorimotor areas precisely indexes learning of a challenging motor task."

      We further refer to these new results in the discussion section (page 23, lines 521 – 528):

      "Moreover, we found that SO-spindle coupling strength remains remarkably stable between two nights, which also explains why a learning-induced change in coupling strength did not relate to behavior (Figure 3 – figure supplement 2I). Thus, our results primarily suggest that strength of SO-spindle coupling correlates with the ability to learn (trait), but does not solely convey the recently learned information. This set of findings is in line with recent ideas that strong coupling indexes individuals with highly efficient subcortical-cortical network communication (Helfrich et al, 2021)."

      Additionally, we now provide descriptive data of the adaptation and learning night (Table R3) in the Supplementary file – table 1 and explicitly mention the adaptation night in the results section, which was previously only mentioned in the method section(page 6, lines 101 – 105):.

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)."

      Reviewer #2 (Public Review):

      In this study Hahn and colleagues investigate the role of Slow-oscillation spindle coupling for motor memory consolidation and the impact of brain maturation on these interactions. The authors employed a real-life gross-motor task, where adolescents and adults learned to juggle. They demonstrate that during post-learning sleep SO-spindles are stronger coupled in adults as compared to adolescents. The authors further show, that the strength of SO-spindle coupling correlates with overnight changes in the learning curve and task proficiency, indicating a role of SO-spindle coupling in motor memory consolidation.

      Overall, the topic and the results of the present study are interesting and timely. The authors employed state of the art analyse carefully taking the general variability of oscillatory features into account. It also has to be acknowledged that the authors moved away from using rather artificial lab-tasks to study the consolidation of motor memories (as it is standard in the field), adding ecological validity to their findings. However, some features of their analyses need further clarification.

      We thank the reviewer for their positive assessment of our manuscript. Incorporating the encouraging and helpful feedback, we believe that we substantially improved the clarity and robustness of our analyses.

      1) Supporting and extending previous work of the authors (Hahn et al, 2020), SO-spindle coupling over centro-parietal areas was stronger in adults as compared to adolescents. Despite these differences in the EEG results the authors collapsed the data of adults and adolescents for their correlational analyses (Fig. 4a and 4b). Why would the authors think that this procedure is viable (also given the fact that different EEG systems were used to record the data)?

      We thank the reviewers for the opportunity to clarify why we think it is viable to collapse the data of adolescents and adults for our correlational analyses. In the following we split our answers based on the two points raised by the reviewers: (1) electrophysiological differences (i.e. coupling strength) between the groups and (2) potential signal differences due to different EEG systems.

      1. Electrophysiological differences

      Upon inspecting the original Figure 4, it is apparent that the coupling strength of the combined sample does not form isolated clusters for each age group. In other words, while adult coupling strength is on the higher and adolescent coupling on the lower end due to the developmental increase in coupling strength we reported in the original Figure 3F, both samples overlap forming a linear trend. Second, when running the correlational analyses between coupling strength and task proficiency as well as learning curve separately for each age group, we found that they follow the same direction (Figure R3). Adolescents with higher coupling strength show better task proficiency (Figure R3A, rhos = 0.66, p = 0.005). This effect was also present when using robust regression (b = 109.97, t(15)=3.13, rho = 0.63, p = 0.007). Like adolescents, adults with higher coupling strength at C4 displayed better task proficiency after sleep (Figure R3B, rhos = 0.39, p = 0.053). This relationship was stronger when using robust regression (b = 151.36, t(23)=3.17, rho =0.56, p = 0.004). For learning curves, we found the expected negative correlation at C4 for adolescents (Figure R3C, rhos = -0.57, p = 0.020) and adults (Figure R3D, rhos = -0.44, p = 0.031). Results were comparable when using robust regression (adolescents: b = -59.58, t(15) = -2.94, rho = -0.60, p = 0.010; adults: b = -21.99, t(23 )= -1.71, rho = -0.37, p = 0.101).

      Taken together, these results demonstrate that adolescents and adults show the effects and the same direction at the same electrode, thus, making it highly unlikely that our results are just by chance and that our initial correlation analyses are just driven by one group.

      Additionally, we already controlled for age in our original analyses using partial correlations (also refer to our answer to issue #6). Hence, our additional analyses provide additional support that it is viable to collapse the analyses across both age groups even though they differ in coupling strength.

      1. Different EEG-systems

        The reviewers also raise the question whether our analyses might be impacted by the different EEG systems we used to record our data. This is an important concern especially when considering that cross-frequency coupling analyses can be severely confounded by differences in signal properties (Aru et al, 2015). In our sample, the strongest impact factor on signal properties is most likely age, given the broadband power differences in the power spectrum we found between the groups (original Figure 3A). Importantly, we also found a similar systematic power difference in our longitudinal study using the same ambulatory EEG system for both data recordings (Hahn et al, 2020). This is in line with numerous other studies demonstrating age related EEG power changes in broadband- as well as SO and sleep spindle frequency ranges (Campbell & Feinberg, 2016; Feinberg & Campbell, 2013; Helfrich et al, 2018; Kurth et al, 2010; Muehlroth et al, 2019; Muehlroth & Werkle-Bergner, 2020; Purcell et al, 2017). Therefore, we already had to take differences in signal property into account for our cross-frequency analyses. Regardless whether the underlying cause is an age difference or different signal-to-noise ratios of different EEG systems.

      To mitigate confounds in the signal, we used a data-driven and individualized approach detecting SO and sleep spindle events based on individualized frequency bands and a 75-percentile amplitude criterion relative to the underlying signal. Additionally we z-normalized all spindle events prior to the cross-frequency coupling analyses (Figure R3E). We found no amplitude differences around the spindle peak (point of SO-phase readout) between adolescents that were recorded with an ambulatory amplifier system (alphatrace) and adults that were recorded with a stationary amplifier system (neuroscan) using cluster-based random permutation testing. This was also the case for the SO-filtered (< 2 Hz) signal (Figure R3E, inset). Critically, the significant differences in amplitude from -1.4 to -0.8 s (p = 0.023, d = -0.73) and 0.4 to 1.5 s (p < 0.001, d = 1.1) are not caused by age related differences in power or different EEG-systems but instead by the increased coupling strength (i.e. higher coupling precision of spindles to SOs) in adults giving rise to a more pronounced SO-wave shape when averaging across spindle peak locked epochs.

      Consequently, our analysis pipeline already controlled for possible differences in signal property introduced through different amplifier systems. Nonetheless, we also wanted to directly compare the signal-to-noise ratio of the ambulatory and stationary amplifier systems. However, we only obtained data from both amplifier systems in the adult sleep first group, because we recorded EEG during the juggling learning phase with the ambulatory system in addition to the PSG with the stationary system. First, we computed the power spectra in the 1 to 49 Hz frequency range during the juggling learning phase (ambulatory) and during quiet wakefulness (stationary) for every subject in the adult sleep first group in 10-seconds segments. Next, we computed the signal-to-noise ratio (mean/standard deviation) of the power spectra per frequency across all segments. We only found a small negative cluster from 21.9 to 22.5 Hz (p = 0.042, d = 0.53; Figure R3F), which did not pertain our frequency-bands of interest. Critically, the signal-to-noise ratio of both amplifiers converged in the upper frequency bands approaching the noise floor, therefore, strongly supporting the notion that both systems in fact provided highly comparable estimates.

      In conclusion, both age groups display highly similar effects and direction when correlating coupling strength with behavior. Further, after individualization and normalization the analytical signal, we found no differences in signal properties that would confound the cross-frequency analysis. Lastly, we did not find systematic differences in signal-to-noise ratio between the different EEG-systems. Thus, we believe it is justified to collapse the data across all participants for the correlational analyses, as it combines both, the developmental aspect of enhanced coupling precision from adolescence to adulthood and the behavioral relevance for motor learning which we deem a critical research advance from our previous study.

      Figure R3

      (A) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) for adolescents of the sleep-first group (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. Grey-shaded area indicates 95% confidence intervals of the robust trend line. Participants with a more precise SO-spindle coordination show improved task proficiency after sleep. (B) Cluster-corrected correlation of coupling strength and overnight task proficiency change) for adults. Same conventions as in (A). Similar trend of higher coupling strength predicting better task proficiency after sleep (C) Cluster-corrected correlation of coupling strength and overnight learning curve change for adolescents. Same conventions as in (A). Higher coupling strength related to a flatter learning curve after sleep. (D) Cluster-corrected correlation of coupling strength and overnight learning curve change for adults. Same conventions as in (A). Higher coupling strength related to a flatter learning curve after sleep. (E) Spindle peak locked epoch (NREM3, co-occurrence corrected) grand averages (mean ± SEM) for adolescents (red) and adults (black). Inset depicts the corresponding SO-filtered (2 Hz lowpass) signal. Black lines indicate significant clusters. Note, we found no difference in amplitude after normalization. Significant differences are due to more precise SO-spindle coupling in adults. Spindle frequency is blurred due to individualized spindle detection. (F) Signal-to-noise ratio for the stationary EEG amplifier (green) during quiet wakefulness and for the ambulatory EEG amplifier (purple) during juggling training. Grey shaded area denotes cluster-corrected p < 0.05. Note that signal-to-noise ratio converges in the higher frequency ranges.

      We have now added Figure R3E as Figure 3B to the revised version of the manuscript to demonstrate that there were no systematic differences between the two age groups in the analytical signal due to the expected age related power differences or EEG-systems. Specifically, we now state in the results section (page 13 – 14, lines 282 – 294):

      "We assessed the cross frequency coupling based on z-normalized spindle epochs (Figure 3B) to alleviate potential power differences due to age (Figure 3 – figure supplement 1A) or different EEG-amplifier systems that could potentially confound our analyses (Aru et al, 2015). Importantly, we found no amplitude differences around the spindle peak (point of SO-phase readout) between adolescents and adults using cluster-based random permutation testing (Figure 3B), indicating an unbiased analytical signal. This was also the case for the SO-filtered (< 2 Hz) signal (Figure 3B, inset). Critically, the significant differences in amplitude from -1.4 to -0.8 s (p = 0.023, d = -0.73) and 0.4 to 1.5 s (p < 0.001, d = 1.1) are not caused by age related differences in power or different EEG-systems but instead by the increased coupling strength (i.e. higher coupling precision of spindles to SOs) in adults giving rise to a more pronounced SO-wave shape when averaging across spindle peak locked epochs."

      Further, we added the correlational analyses that we computed separately for the age groups (Figure R3A-D) to the revised manuscript (Figure 3 – figure supplement 2CD) as they further substantiate our claims about the relationship between SO-spindle coupling and gross-motor learning.

      We now refer to these analyses in the results section (page 16, lines 338 – 343):

      "Critically, when computing the correlational analyses separately for adolescents and adults, we identified highly similar effects at electrode C4 for task proficiency (Figure 3 – figure supplement 2C) and learning curve (Figure 3 – figure supplement 2D) in each group. These complementary results demonstrate that coupling strength predicts gross-motor learning dynamics in both, adolescents as well as adults, and further show that this effect is not solely driven by one group."

      2) The authors might want to explicitly show that the reported correlations (with regards to both learning curve and task proficiency change) are not driven by any outliers.

      We thank the reviewers for their suggestion. We agree that when inspecting the scatter plots it looks like that the correlations could be severely influenced by two outliers in the adult group. Because this is an important matter, we recalculated all previously reported correlations without the two outliers (Figure R4, left column) and followed the reviewer’s suggestion to also compute robust regression (Figure R4, right column) and found no substantial deviation from our original results.

      In more detail, increase in task proficiency resulted in flattening of the learning curve when removing outliers (Figure R4A, rhos = -0.70, p < 0.001) and when applying robust regression analysis (Figure R4B, b = -0.30, t(67) = -10.89, rho = -0.80, p < 0.001). Likewise, higher coupling strength still predicted better task proficiency (mean rho = 0.35, p = 0.029, cluster-corrected) and flatter learning curves after sleep (rho = -0.44, p = 0.047, cluster-corrected) when removing the outliers (Figure R4CE) and when calculating robust regression (Figure R4DF, task proficiency: b = 82.32, t(40) = 3.12, rho = 0.45, p = 0.003; learning curve: b = -26.84, t(40) = -2.96, rho = -0.43, p = 0.005). Furthermore, we calculated spearman rank correlations and cluster-corrected spearman rank correlations in our original manuscript, to mitigate the impact of outliers, even though Pearson correlations are more widely used in the field. Therefore, we still report spearman rank correlations for single electrodes instead of robust correlations as it is more consistent with the cluster-correlation analyses.

      We now use robust trend lines instead of linear trend lines in our scatter plots. Further, we added the correlations without outliers (Figure R4ACE) to the supplements as Figure 2 – figure supplement 1D and Figure 3 – figure supplement 2 FG. These additional analyses are now reported in the results section of the revised manuscript (page 9, lines 186 – 191):

      "[…] we confirmed a strong negative correlation between the change (post retention values – pre retention values) in task proficiency and the change in learning curve after the retention interval (Figure 2F; rhos = -0.71, p < 0.001), which also remained strong after outlier removal (Figure 2 – figure supplement 1D). This result indicates that participants who consolidate their juggling performance after a retention interval show slower gains in performance."

      And (page 16, lines 343 – 346):

      "[…] Furthermore, our results remained consistent when including coupled spindle events in NREM2 (Figure 3 – figure supplement 2E) and after outlier removal (Figure 3 – figure supplement 2FG)."

      Furthermore, we now state that we specifically utilized spearman rank correlations to mitigate the impact of outliers in our analyses in the method section (page 35, lines 808 – 813)::

      "For correlational analyses we utilized spearman rank correlations (rhos; Figure 2F & Figure 3DE) to mitigate the impact of possible outliers as well as cluster-corrected spearman rank correlations by transforming the correlation coefficients to t-values (p < 0.05) and clustering in the space domain (Figure 3DE). Linear trend lines were calculated using robust regression."

      Figure R4:

      (A) Spearman rank correlation between task proficiency change and learning curve change collapsed across adolescents (red dot) and adults (black diamonds) after removing two outlier subjects in the adult age group. Grey-shaded area indicates 95% confidence intervals of the robust trend line. (B) Robust regression of task proficiency change and learning curve change of the original sample. (C) Cluster-corrected correlations (right) between individual coupling strength and overnight task proficiency change (post – pre retention) after outlier removal (left, spearman correlation at C4, uncorrected). Asterisks indicate cluster-corrected two-sided p < 0.05. (D) Robust regression of coupling strength at C4 and task proficiency of the original sample. (E) Same conventions as in (C) but for overnight learning curve change. (F) Same conventions as in (D) but for overnight learning curve change.

      3) The sleep data of all participants (thus from both sleep first and wake first) were used to determine the features of SO-spindle coupling in adolescents and adults. Were there any differences between groups (sleep first vs. wake first)? This might be in interesting in general but especially because only data of the sleep first group entered the subsequent correlational analyses.

      We thank the reviewers for their remark. We agree that adding additional information about possible differences between the sleep first and wake first groups would allow for a more comprehensive assessment of the reported data. We did not explain our reasoning to include only the sleep first groups for the correlation analyses clearly enough in the original manuscript. Unfortunately, we can only report data for the adolescents in our sample, because we did not record polysomnography (PSG) for the adult wake first group. This is also one of the two reasons why we focused on the sleep first groups for our correlational analyses.

      Adolescents in the sleep first group did not differ from adolescents in the wake first group in terms of sleep architecture (except REM (%), which did not correlate with behavior [task proficiency: rho = -0.17, p = 0.28; learning curve: -0.02, p = 0.90]) as well as SO and sleep spindle event descriptive measures (see Table R2). Importantly, we found no differences in coupling strength between the two groups (Figure R2A).

      Table R2. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents in the sleep first and wake first group (mean ± standard deviation). Independent t-tests were used for comparisons

      The second reason why we focused our analyses on sleep first was that adolescents in the wake first group had higher task proficiency after the sleep retention interval than the sleep first group (Figure R2A; t(23) = -2.24, p = 0.034). This difference in performance is directly explained by the additional juggling test that the wake first group performed at the time point of their learning night, which should be considered as additional training. Therefore, we excluded the wake first group from our correlational analyses because sleep and wake first group are not comparable in terms of juggling training during the night when we assessed SO-spindle coupling strength.

      Figure R2

      (A) Comparison of SO-spindle coupling strength in the adolescent sleep first (blue) and wake first (green) group using cluster-based random permutation testing (Monte-Carlo method, cluster alpha 0.05, max size criterion, 1000 iterations, critical alpha level 0.05, two-sided). Left: exemplary depiction of coupling strength at electrode C4 (mean ± SEM). Right: z-transformed t-values plotted for all electrodes obtained from the cluster test. No significant clusters emerged. (B) Comparison of task proficiency between sleep first and wake first group after the sleep retention interval (mean ± SEM). Adolescents in the wake first group had higher task proficiency given the additional juggling performance test, which also reflects additional training.

      These additional analyses (Figure R2) and the summary statistics of sleep architecture and SO/spindle event descriptives of adolescents in the sleep first and wake first group (Table R2), are now reported in the revised version of the manuscript as Figure 3 – figure supplement 2AB and Supplementary file – table 7. We now explicitly explain our rationale of why we only considered participants in the sleep first group for our correlational analyses in the results section (page 6, lines 101 – 105):

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)"

      And (page 15, lines 311 – 320):

      "[…] Furthermore, given that we only recorded polysomnography for the adults in the sleep first group and that adolescents in the wake first group showed enhanced task proficiency at the time point of the sleep retention interval due to additional training (Figure 3 – figure supplement 2A), we only considered adolescents and adults of the sleep-first group to ensure a similar level of juggling experience adolescents and adults of the sleep-first group to ensure a similar level of juggling experience (for summary statistics of sleep architecture and SO and spindle events of subjects that entered the correlational analyses see Supplementary file – table 6). Notably, we found no differences in electrophysiological parameters (i.e. coupling strength, event detection) between the adolescents of the wake first and sleep first group (Figure 3 – figure supplement 2B & Supplementary file – table 7)."

      4) To allow a more comprehensive assessment of the underlying data information with regards to general sleep descriptives (minutes, per cent of time spent in different sleep stages, overall sleep time etc.) as well as related to SOs, spindles and coupled events (e.g. number, density etc.) would be needed.

      We agree with the reviewers that additional information about sleep architecture and SO as well as sleep spindle characteristics are needed for a more comprehensive assessment of our data. We now added summary tables for sleep architecture and SO/spindle event descriptive measures for the whole sample (Table R4) and for the sleep first groups that we used for our correlational analyses (Table R5) to the supplementary material in the updated manuscript. It is important to note, that due to the longer sleep opportunity of adolescents that we provided to accommodate the overall higher sleep need in younger participants, adolescents and adults differed in most general sleep architecture markers and SO as well as sleep spindle descriptive measures. In addition, changes in sleep architecture are prominent during the maturational phase from adolescence to adulthood, which might introduce additional variance between the two age groups.

      Table R4. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents and adults across the whole sample (mean ± standard deviation) in the learning night. Independent t-tests were used for comparisons

      Table R5. Summary of sleep architecture and SO/spindle event descriptive measures (at electrode C4) of adolescents and adults in the sleep first group (mean ± standard deviation) in the learning night. Independent t-tests were used for comparisons

      In order to ensure that our correlational analyses are not driven by these systematic differences between the two age groups, we used cluster-corrected partial correlations to control for sleep architecture markers (Figure R7) and SO/spindle descriptive measurements (Figure R8A). Critically, none of these possible confounders changed the pattern of our initial correlational analyses of coupling strength and task proficiency/learning curve. Additionally, we also controlled for differences in spindle event number by using a bootstrapped resampling approach. We randomly drew 200 spindle events in 100 iterations and subsequently recalculated the coupling strength for each subject. We found that resampled values and our original observation of coupling strength are almost perfectly correlated, indicating that differences in event number are unlikely to have an impact on coupling strength as long as there are at least 200 events (Figure R8B). Combined these analyses demonstrate that our correlations between coupling strength and behavior are not influenced by the reported differences in sleep architecture and SO/spindle descriptive measures.

      Figure 7R

      Summary of cluster-corrected partial correlations of coupling strength with task proficiency (left) and learning curve (right) controlling for possible confounding factors. Asterisks indicate location of the detected cluster. The pattern of initial results remained highly stable.

      Figure R8

      (A) Summary of cluster-corrected partial correlations of coupling strength with task proficiency (left) and learning curve (right) controlling SO/spindle descriptive measures at critical electrode C4. Asterisks indicate location of the detected cluster. The pattern of initial results remained highly stable. (B) Spearman correlation between resampled coupling strength (N = 200, 100 iterations) and original observation of coupling strength for adolescents (red circles) and adults (black diamonds), indicating that coupling strength is not influenced by spindle event number if at least 200 events are present. Grey-shaded area indicates 95% confidence intervals of the robust trend line.

      We now provide general sleep descriptives (Table R4 & R5) in the revised version of the manuscript as Supplementary file – table 2 & table 6. These data are referred to in the results section (page 6, lines 101 – 105):

      "Polysomnography (PSG) was recorded during an adaptation night and during the respective sleep retention interval (i.e. learning night) except for the adult wake-first group (for sleep architecture descriptive parameters of the adaptation night and learning night as well as for adolescents and adults see Supplementary file – table 1 & 2)."

      And (page 15, lines 311 – 318):

      "Furthermore, given that we only recorded polysomnography for the adults in the sleep first group and that adolescents in the wake first group showed enhanced task proficiency at the time point of the sleep retention interval due to additional training (Figure 3 – figure supplement 2A), we only considered adolescents and adults of the sleep-first group to ensure a similar level of juggling experience (for summary statistics of sleep architecture and SO and spindle events of subjects that entered the correlational analyses see Supplementary file – table 6)."

      The additional control analyses (Figure R7 & R8) are also now added to the revised manuscript as Figure 3 – figure supplement 3 & 4 in the results section (page 16, lines 356 – 360):

      "For a summary of the reported cluster-corrected partial correlations as well as analyses controlling for differences in sleep architecture see Figure 3 – figure supplement 3. Further, we also confirmed that our correlations are not influenced by individual differences in SO and spindle event parameters (Figure 3 – figure supplement 4)."

      5) The authors used a partial correlations to rule out that age drove the relationship between coupling strength, learning curve and task proficiency. It seems like this analysis was done specifically for electrode C4, after having already established that coupling strength at electrode C4 correlates in general with changes in the learning curve and task proficiency. I think the claim that results were not driven by age as confounding factor would be stronger if the authors used a cluster-corrected partial correlation in the first place (just as in the main analysis).

      The reviewers are correct that initially we only conducted the partial correlation for electrode C4. Following the reviewers suggestion we now additionally computed cluster-corrected partial correlations similar to our main analysis. Like in our original analyses, we found a significant positive central cluster (Figure R6A, mean rho = 0.40, p = 0.017) showing that higher coupling strength related to better task proficiency after sleep and a negative cluster-corrected correlation at C4 showing that higher coupling strength was related to flatter learning curves after sleep (Figure R6B, rho = -0.47, p = 0.049) also when controlling for age.

      Figure R6

      (A) Cluster-corrected partial correlation of individual coupling strength in the learning night and overnight change in task proficiency (post – pre retention) collapsed across adolescents and adults, controlling for age. Asterisks indicate cluster-corrected two-sided p < 0.05. A similar significant cluster to the original analysis (Figure 4A) emerged comprising electrodes Cz and C4. (B) Same conventions as in A. Like in the original analysis (Figure 4B) a negative correlation between coupling strength at C4 and learning curve change survived cluster-corrected partial correlations when controlling for age.

      We now always report cluster-corrected partial correlations when controlling for possible confounding variables in the updated version of the manuscript (also see answer to issue #7). A summary of all computed partial correlations including Figure R6 can now be found as Figure 3 – figure supplement 3 & 4 in the revised manuscript.

      Specifically we now state in the results section (page 16 – 17, lines 347 – 360):

      "To rule out age as a confounding factor that could drive the relationship between coupling strength, learning curve and task proficiency in the mixed sample, we used cluster-corrected partial correlations to confirm their independence of age differences (task proficiency: mean rho = 0.40, p = 0.017; learning curve: rhos = -0.47, p = 0.049). Additionally, given that we found that juggling performance could underlie a circadian modulation we controlled for individual differences in alertness between subjects due to having just slept. We partialed out the mean PVT reaction time before the juggling performance test after sleep from the original analyses and found that our results remained stable (task proficiency: mean rho = 0.37, p = 0.025; learning curve: rhos = -0.49, p = 0.040). For a summary of the reported cluster-corrected partial correlations as well as analyses controlling for differences in sleep architecture see Figure 3 – figure supplement 3. Further, we also confirmed that our correlations are not influenced by individual differences in SO and spindle event parameters (Figure 3 – figure supplement 4)."

      And in the methods section (page 35, lines 813 – 814):

      "To control for possible confounding factors we computed cluster-corrected partial rank correlations (Figure 3 – figure supplement 3 and 4)."

      References

      Aru, J., Aru, J., Priesemann, V., Wibral, M., Lana, L., Pipa, G., Singer, W. & Vicente, R. (2015) Untangling cross-frequency coupling in neuroscience. Curr Opin Neurobiol, 31, 51-61.

      Bothe, K., Hirschauer, F., Wiesinger, H. P., Edfelder, J., Gruber, G., Birklbauer, J. & Hoedlmoser, K. (2019) The impact of sleep on complex gross-motor adaptation in adolescents. Journal of Sleep Research, 28(4).

      Bothe, K., Hirschauer, F., Wiesinger, H. P., Edfelder, J. M., Gruber, G., Hoedlmoser, K. & Birklbauer, J. (2020) Gross motor adaptation benefits from sleep after training. J Sleep Res, 29(5), e12961.

      Campbell, I. G. & Feinberg, I. (2016) Maturational Patterns of Sigma Frequency Power Across Childhood and Adolescence: A Longitudinal Study. Sleep, 39(1), 193-201.

      Dayan, E. & Cohen, L. G. (2011) Neuroplasticity subserving motor skill learning. Neuron, 72(3), 443-54. De Gennaro, L. & Ferrara, M. (2003) Sleep spindles: an overview. Sleep Med Rev, 7(5), 423-40.

      De Gennaro, L., Ferrara, M., Vecchio, F., Curcio, G. & Bertini, M. (2005) An electroencephalographic fingerprint of human sleep. Neuroimage, 26(1), 114-22.

      Dinges, D. F., Pack, F., Williams, K., Gillen, K. A., Powell, J. W., Ott, G. E., Aptowicz, C. & Pack, A. I. (1997) Cumulative sleepiness, mood disturbance, and psychomotor vigilance performance decrements during a week of sleep restricted to 4-5 hours per night. Sleep, 20(4), 267-77.

      Dinges, D. F. & Powell, J. W. (1985) Microcomputer Analyses of Performance on a Portable, Simple Visual Rt Task during Sustained Operations. Behavior Research Methods Instruments & Computers, 17(6), 652-655.

      Eichenlaub, J. B., Biswal, S., Peled, N., Rivilis, N., Golby, A. J., Lee, J. W., Westover, M. B., Halgren, E. & Cash, S. S. (2020) Reactivation of Motor-Related Gamma Activity in Human NREM Sleep. Front Neurosci, 14, 449.

      Feinberg, I. & Campbell, I. G. (2013) Longitudinal sleep EEG trajectories indicate complex patterns of adolescent brain maturation. American Journal of Physiology - Regulatory, Integrative and Comparative Physiology, 304(4), R296-303.

      Hahn, M., Heib, D., Schabus, M., Hoedlmoser, K. & Helfrich, R. F. (2020) Slow oscillation-spindle coupling predicts enhanced memory formation from childhood to adolescence. Elife, 9.

      Helfrich, R. F., Lendner, J. D. & Knight, R. T. (2021) Aperiodic sleep networks promote memory consolidation. Trends Cogn Sci.

      Helfrich, R. F., Lendner, J. D., Mander, B. A., Guillen, H., Paff, M., Mnatsakanyan, L., Vadera, S., Walker, M. P., Lin, J. J. & T., K. R. (2019) Bidirectional prefrontal-hippocampal dynamics organize information transfer during sleep in humans. Nature Communications, 10(1), 3572.

      Helfrich, R. F., Mander, B. A., Jagust, W. J., Knight, R. T. & Walker, M. P. (2018) Old Brains Come Uncoupled in Sleep: Slow Wave-Spindle Synchrony, Brain Atrophy, and Forgetting. Neuron, 97(1), 221-230 e4.

      Killgore, W. D. (2010) Effects of sleep deprivation on cognition. Prog Brain Res, 185, 105-29.

      Kurth, S., Jenni, O. G., Riedner, B. A., Tononi, G., Carskadon, M. A. & Huber, R. (2010) Characteristics of sleep slow waves in children and adolescents. Sleep, 33(4), 475-80.

      Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Methods, 164(1), 177-90.

      Muehlroth, B. E., Sander, M. C., Fandakova, Y., Grandy, T. H., Rasch, B., Shing, Y. L. & Werkle-Bergner, M. (2019) Precise Slow Oscillation-Spindle Coupling Promotes Memory Consolidation in Younger and Older Adults. Sci Rep, 9(1), 1940.

      Muehlroth, B. E. & Werkle-Bergner, M. (2020) Understanding the interplay of sleep and aging: Methodological challenges. Psychophysiology, 57(3), e13523.

      Niethard, N., Ngo, H. V. V., Ehrlich, I. & Born, J. (2018) Cortical circuit activity underlying sleep slow oscillations and spindles. Proceedings of the National Academy of Sciences of the United States of America, 115(39), E9220-E9229.

      Purcell, S. M., Manoach, D. S., Demanuele, C., Cade, B. E., Mariani, S., Cox, R., Panagiotaropoulou, G., Saxena, R., Pan, J. Q., Smoller, J. W., Redline, S. & Stickgold, R. (2017) Characterizing sleep spindles in 11,630 individuals from the National Sleep Research Resource. Nature Communications, 8, 15930.

      Van Dongen, H. P., Maislin, G., Mullington, J. M. & Dinges, D. F. (2003) The cumulative cost of additional wakefulness: dose-response effects on neurobehavioral functions and sleep physiology from chronic sleep restriction and total sleep deprivation. Sleep, 26(2), 117-26.

      Wilhelm, I., Metzkow-Meszaros, M., Knapp, S. & Born, J. (2012) Sleep-dependent consolidation of procedural motor memories in children and adults: the pre-sleep level of performance matters. Developmental Science, 15(4), 506-15.

      Winer, J. R., Mander, B. A., Helfrich, R. F., Maass, A., Harrison, T. M., Baker, S. L., Knight, R. T., Jagust, W. J. & Walker, M. P. (2019) Sleep as a potential biomarker of tau and beta-amyloid burden in the human brain. J Neurosci.

    1. Author Response:

      Reviewer #1:

      Maimon-Mor et al. examined the control of reaching movement of one-handers, who were born with a partial arm, and amputees, who lost their arm in adulthood. The authors hypothesized that since one-handers started using their artificial arm earlier in life then amputees, they are expected to exhibit better motor control, as measured by point-to-point reaching accuracy. Surprisingly, they found the opposite, that the reaching accuracy of one-handers is worse than that of amputees (and control with their non-dominant hand). This deficit in motor control was reflected in an increase in motor noise rather than consistent motor biases.

      Strengths:

      • I found the paper in general very well and clearly written.
      • The authors provide detailed analyses to examine various possible factors underlying deficits in reaching movements in one-handers and amputees, including age at which participants first used an artificial arm, current usage of the arm, performance in hand localization tasks, and statistical methods that control for potential confounding factors.
      • The results that one handers, who start using the artificial arm at early age, show worse motor control than amputees, who typically start using the arm during adulthood, are surprising and interesting. Also intriguing are the results that reaching accuracy is negatively correlated with the time of limbless experience in both groups. These results suggest that there is a plasticity window that is not anchored to a certain age, but rather to some interference (perhaps) from the time without the use of artificial arm. In one-handers these two time intervals are confounded by one another, but the amputees allow to separate them. I think that the results have implications for understanding plasticity aspects of acquiring skills for using artificial limbs.

      Weaknesses:

      • While I found that one of the main conclusion from the paper is that the main factor that is related to increased motor noise is the time spent without the artificial arm, it felt that this was not emphasized as such. These results are not mentioned in the abstract and the correlation for amputees is not shown in a figure.

      We thank the reviewer for their comment. While it is true that motor noise correlated with time of limbless experience in both groups, we were hesitant to highlight the results found in amputees, considering the small number of participants, and lack of converging evidence (e.g., contrary to the congenital group, we did not find a strong main effect). For these reasons, we have chosen to include it in the manuscript but not highlight it or base our main conclusions on it. Following the reviewer’s comment, the correlation of the amputees’ data is now visualised in Figure 3. Moreover, while the behavioural correlation might be similar in both groups, from a neural standpoint, the limbless experience of a toddler with a developing brain is qualitatively different to that of an adult, with a fully developed brain, who has lost a limb. As such, we were hesitant to link these two findings into a single framework, however in the revised manuscript we highlight this tentative link.

      Discussion (4th paragraph):

      “In both the congenital and acquired groups, artificial arm reaching motor noise correlated with the amount of time they spent using only their residual limb. It is therefore tempting to link these two results under a unifying interpretation; however, this requires further research, considering the neural differences between the two groups.”

      Figure 3. Years of limbless experience before first artificial arm use in the acquired group. (A) Relationship between years of limbless experience and (A) artificial arm reaching errors or (B) artificial arm motor noise in the acquired group.

      • The suggested mechanism of a deficit in visuomotor integration is not clear, and whether the results indeed point to this hypothesis. The results of the reaching task show that the one-handers exhibit higher motor noise and initial error direction than amputees. The results of the 2D localization task (the same as the standard reaching task but without visual feedback) show no difference in errors between the groups. First, it is not clear how the findings of the 2D localization task are in line with the results that one-handers show larger initial directional errors.

      We fully take on the reviewer’s comment regarding the vague use of the term visuomotor integration. In the revised manuscript, we have opted instead for a much broader term, suggesting a deficit in visual-based corrective movements, considering we are limited in our ability to infer the specific underlying mechanism from our result. We have also made changes to the abstract based on the reviewer’s comment (see below).

      With regards to discussing how the various results fit together, in the revised manuscript, these are now discussed more at length. In short, in the 2D localisation task (reaching without visual feedback), participants were not instructed to perform fast ballistic movements. Instead, participants were instructed that they could perform movements to correct for their initial aiming error (using proprioception). Together with the similar performance observed for the proprioceptive task, this strengthens our suggestion that the deficit in the congenital group is triggered by visual-driven corrections. These various considerations are now detailed as follows:

      Abstract:

      “Since we found no group differences when reaching without visual feedback, we suggest that the ability to perform efficient visually-based corrective movements, is highly dependent on either biological or artificial arm experience at a very young age.”

      Result (section 7, 1st paragraph):

      “From these results, we infer that early-life experience relates to a suboptimal ability to reduce the system’s inherent noise, and that this is possibly not related to the noise generated by the execution of the initial motor plan. Early life experience might therefore relate to better use of visual feedback in performing corrective movements. The continuous integration of visual and sensory input is at the heart of visually- driven corrective movements. Therefore, one possibility is that limited early life experience, results in suboptimal integration of information within the sensorimotor system.”

      Discussion (2nd paragraph):

      “When performing reaching movements without visual feedback (2D localisation task), the congenital group did not differ from the acquired or control group. This begs the question, if the congenital group has a deficit in motor planning why was it not evident in this task as well? In the 2D localisation task, unlike the main task, participants were allowed to make corrective movements. While they did not receive visual feedback, the proprioceptive and somatosensory feedback from the residual limb appears to be enough to allow them to correct for initial reaching errors and perform at the same level as the acquired and control group. Moreover, we did not find strong evidence for an impaired sense of localisation of either the residual or the artificial arm in the congenital group. As such, by elimination, our evidence suggests that the process of using visual information to perform corrective movements isn’t as efficient in the congenital group.”

      Discussion (2nd paragraph):

      “Lack of concurrent visual and motor experience during development might therefore cause a deficit in the ability to form the computational substrates and thus to efficiently use visual information in performing corrective movements.”

      Discussion (last paragraph):

      “By the process of elimination, we have nominated suboptimal visual feedback-based corrections to be the most likely cause underlying this motor deficit.”

      Second, I think that these results suggest that the deficiency in one-handers is with feedback responses rather than feedforward. This may also be supported by the correlation with age: early age is correlated with less end-point motor noise, rather than initial directional error. Analyses of feedback correction might help shedding more light on the mechanism. The authors mention that the participants were asked to avoid doing corrective movement and imposed a limit of 1 sec per reach to encourage that. But it is not clear whether participants actually followed these instructions. 1 sec could be enough time to allow feedback responses, especially for small amplitude movements (e.g., <10 cm).

      Please see below our response to the feedback correction analysis suggestion. Regarding corrective movements, we had the same concern as the reviewer which led us to use hand velocity data to identify first movement termination. We apologise if the experimental design and pre-processing procedures were not clear.

      In short, a 1 sec trial duration was imposed on all trials to generate a sense of time- pressure and encourage participants to perform fast ballistic movements. As we were worried that participants might still perform secondary corrective movements within this 1 sec window, for each trial, we used the hand velocity profile to identify the end of the first movement. Below, we have plotted the arm velocity from a single trial to illustrate this procedure. For this trial, the timepoint indicated by the circular marker has been identified as the time of the end of the first movement (See Methods for further information). For each trial, endpoint location was defined as the location of the arm at the movement termination timepoint defined by the kinematic data and not the endpoint at the 1 sec timepoint. It is worth noting that performing the same analysis using the end- points recorded at the 1 sec timepoint did not generate different statistical results.

      This has now been further clarified in the text.

      Results (section 1, 1st paragraph):

      “Reaching performance was evaluated by measuring the mean absolute error participants made across all targets (see Figure 1C). The absolute error refers to the distance from the cursor’s position at the end of the first reach (endpoint) to the centre of the target in each trial. The endpoint of each trial was set as the arm location at the end of the first reaching movement, identified using the trial’s kinematic data (See Methods).”

      Methods (section: Data processing and analysis – main task):

      “Within the 1 sec movement time constraint, in some trials, participants still performed secondary corrective movements. We therefore used the tangential arm velocities to identify the end of the first reach in each trial (i.e., movement termination).”

      Reviewer #2:

      This is a broad and ambitious study that is fairly unique in scope - the questions it seek to answer are difficult to answer scientifically, and yet the depth of the questions it seeks to answer and the framework in which it is founded seem out of place in a clinical journal.

      And yet, as a scientist and clinician, I found myself objecting to the claims of the authors, only have them to address my objection in the very next section. The results are surprising, but compelling - the authors have done an excellent job of untangling a very complicated question, and they have tested (for our field) a large number of subjects.

      The main two results of the paper, from my perspective, are as follows:

      1) Persons with an amputation can form better models of new environments, such as manipulandums, than can those with congenital deficiencies. This result is interesting because a) the task did not depend on significant use of the device (they were able to use their intact musculature for the reaching-based task), and b) the results were not influenced by the devices used by the subjects (cosmetic, body-powered, or myoelectric).

      2) Persons with congenital deficiency fit earlier in life had less error than those fit later in life.

      Taken together, these results suggest that during early childhood the brain is better able to develop the foundation necessary to develop internal models and that if this is deprived early in childhood, it cannot be regained later in life - even if subjects have MORE experience. (E.g., those with congenital deficiencies had more experience using their prosthetic arm than those with amputation, and yet scored worse).

      The questions analyzed by the researchers are excellent and the statistical methods are generally appropriate. My only minor concern is that the authors occasionally infer that two groups are the same when a large p-value is reported, whereas large p-values do not convey that the groups are the same; only that they cannot be proven to be different. The authors would need to use a technique such as ICC or analysis of similarities to prove the groups are the same.

      We appreciate the reviewer’s concern about inferring the null from classical frequentist statistics. In this manuscript, we have opted to using Bayesian statistics as a measure of testing the significance of similarity across groups (See Methods: Statistical analysis) as opposed to the frequentist methods suggested by the reviewer. This approach is equivalent to the ones proposed by the reviewer and are widely used in our field. A Bayesian Factor (BF) smaller than 0.33 is regarded as sufficient evidence for supporting the null hypothesis that is, that there are no differences between the groups.

      This approach is described in detail in the methods and is introduced in the first section of the results as well.

      Results (1st section 2nd paragraph):

      “To further explore the non-significant performance difference between amputees and controls, we used a Bayesian approach (Rouder et al., 2009), that allows for testing of similarities between groups (the null hypothesis). In this analysis, the smaller effect size of the two reported here (1.39) was inputted as the Cauchy prior width. The resulting Bayesian Factor (BF10=0.28) provided moderate support to the null hypothesis (i.e., smaller than 0.33).”

      Methods (Statistical analysis section):

      “In parametric analyses (ANCOVA, ANOVA, Pearson correlations), where the frequentist approach yielded a non-significant p-value, a parallel Bayesian approach was used and Bayes Factors (BF) were reported (Morey & Rouder, 2015; Rouder et al., 2009, 2012, 2016). A BF<0.33 is interpreted as support for the null-hypothesis, BF > 3 is interpreted as support for the alternative hypothesis (Dienes, 2014). In

      Bayesian ANOVAs and ANCOVA’s, the inclusion Bayes Factor of an effect (BFIncl) is reported, reflecting that the data is X (BF) times more likely under the models that include the effect than under the models without this predictor. When using a Bayesian t-test, a Cauchy prior width of 1.39 was used, this was based on the effect size of the main task, when comparing artificial arm reaches of amputees and one- handers. Therefore, the null hypothesis in these cases would be there is no effect as large as the effect observed in the main task.”

      Following the reviewer’s comment, we have carefully scanned through the manuscript to make sure no equivalence claims are made without the support of a significant BF. In one instance that has been the case and has been rectified.

      Results (3rd section, 2nd paragraph):

      “We compared artificial arm and nondominant arm biases (distance from the centre of the endpoint to the target) across groups, using intact arm biases as a covariate. The ANCOVA resulted in no significant (inconclusive) group differences (F(2,47)=2.40, p=0.1, BFIncl=0.72; see Figure 2A).”

    1. Author Response

      Reviewer #1 (Public Review):

      In one of the most creative eDNA studies I have had the pleasure to review, the authors have taken advantage of an existing program several decades old to address whether insect declines are indeed occurring - an active area of discussion and debate within ecology. Here, they extracted arthropod environmental DNA (eDNA) from pulverized leaf samples collected from different tree species across different habitats. Their aim was to assess the arthropod community composition within the canopies of these trees during the time of collection to assess whether arthropod richness, diversity, and biomass were declining. By utilizing these leaf samples, the greatest shortcoming of assessing arthropod declines - the lack of historical data to compare to - was overcome, and strong timeseries evidence can now be used to inform the discussion. Through their use of eDNA metabarcoding, they were able to determine that richness was not declining, but there was evidence of beta diversity loss due to biotic homogenization occurring across different habitats. Furthermore, their application of qPCR to assess changes in eDNA copy number temporally and associate those changes with changes to arthropod biomass provided support to the argument that arthropod biomass is indeed declining. Taken together, these data add substantial weight to the current discussion regarding how arthropods are being affected in the Anthropocene.

      Thank you very much for the positive assessment of our work.

      I find the conclusions of the paper to be sound and mostly defensible, though there are some issues to take note of that may undermine these findings.

      Firstly, I saw no explanation of the requisite controls for such an experiment. An experiment of this scale should have detailed explanations of the field/equipment controls, extraction controls, and PCR controls to ensure there are no contamination issues that would otherwise undermine the entirety of the study. At one point in the manuscript the presence of controls is mentioned just once, so I surmise they must exist. Trusting such results needs to be taken with caution until such evidence is clearly outlined. Furthermore, the plate layout which includes these controls would help assess the extent of tag-jumping, should the plate plan proposed in Taberlet et al., 2018 be adopted.

      Second, without the presence of adequate controls, filtering schemes would be unable to determine whether there were contaminants and also be unable to remove them. This would also prevent samples from being filtered out should there be excessive levels of contamination present. Without such information, it makes it difficult to fully trust the data as presented.

      Finally, there is insufficient detail regarding the decontamination procedures of equipment used to prepare the samples (e.g., the cryomil). Without clear explanations of the steps the authors took to ensure samples were handled and prepared correctly, there is yet more concern that there may be unseen problems with the dataset.

      We are well aware of the potential issues and consequences of contamination in our work. However, we are also confident that our field and laboratory procedures adequately rule out these issues. We agree with the reviewer that we should expand more on our reasoning. Hence, we have now significantly expanded the Methods section outlining controls and sample purity, particularly under “Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures” (lines 303-304), “Test for DNA carryover in the cryomill” (lines 448-464) and “Statistical analysis” (lines 570-575).

      We ran negative control extractions as well as negative control PCRs with all samples. These controls were sequenced along with all samples and used to explore the effect of experimental contamination. With the exception of a few reads of abundant taxa, these controls were mostly clean. We report this in more detail now in the Methods under “Sequence analysis” (lines 570-575). This suggests that our data are free of experimental contamination or tag jumping issues.

      We have also expanded on the avoidance of contamination in our field sampling protocols. The ESB has been set up for monitoring even the tiniest trace amounts of chemicals. Carryover between samples would render the samples useless. Hence, highly clean and standardized protocols are implemented. All samples are only collected with sterilized equipment under sterile conditions. Each piece of equipment is thoroughly decontaminated before sampling.

      The cryomill is another potential source of cross-contamination. The mill is disassembled after each sample and thoroughly cleaned. Milled samples have already been tested for chemical carryover, and none was found. We have now added an additional analysis to rule out DNA carryover. We received the milling schedule of samples for the past years. Assuming samples get contaminated by carryover between milling runs, two consecutive samples should show signatures of this carryover. We tested this for singletaxon carryover as well as community-wide beta diversity, but did not find any signal of contamination. This gives us confidence that our samples are very pure. The results of this test are now reported in the manuscript (Suppl. Fig 12 & Suppl. Table 3).

      Reviewer #2 (Public Review):

      Krehenwinkel et al. investigated the long-term temporal dynamics of arthropod communities using environmental DNA (eDNA) remained in archived leave samples. The authors first developed a method to recover arthropod eDNA from archived leave samples and carefully tested whether the developed method could reasonably reveal the dynamics of arthropod communities where the leave samples originated. Then, using the eDNA method, the authors analyzed 30-year-long well-archived tree leaf samples in Germany and reconstructed the long-term temporal dynamics of arthropod communities associated with the tree species. The reconstructed time series includes several thousand arthropod species belonging to 23 orders, and the authors found interesting patterns in the time series. Contrary to some previous studies, the authors did not find widespread temporal α-diversity (OTU richness and haplotype diversity) declines. Instead, β-diversity among study sites gradually decreased, suggesting that the arthropod communities are more spatially homogenized in recent years. Overall, the authors suggested that the temporal dynamics of arthropod communities may be complex and involve changes in α- and β-diversity and demonstrated the usefulness of their unique eDNA-based approach.

      Strengths:

      The authors' idea that using eDNA remained in archived leave samples is unique and potentially applicable to other systems. For example, different types of specimens archived in museums may be utilized for reconstructing long-term community dynamics of other organisms, which would be beneficial for understanding and predicting ecosystem dynamics.

      A great strength of this work is that the authors very carefully tested their method. For example, the authors tested the effects of powdered leaves input weights, sampling methods, storing methods, PCR primers, and days from last precipitation to sampling on the eDNA metabarcoding results. The results showed that the tested variables did not significantly impact the eDNA metabarcoding results, which convinced me that the proposed method reasonably recovers arthropod eDNA from the archived leaf samples. Furthermore, the authors developed a method that can separately quantify 18S DNA copy numbers of arthropods and plants, which enables the estimations of relative arthropod eDNA copy numbers. While most eDNA studies provide relative abundance only, the DNA copy numbers measured in this study provide valuable information on arthropod community dynamics.

      Overall, the authors' idea is excellent, and I believe that the developed eDNA methodology reasonably reconstructed the long-term temporal dynamics of the target organisms, which are major strengths of this study.

      Thank you very much for the positive assessment of our work.

      Weaknesses:

      Although this work has major strengths in the eDNA experimental part, there are concerns in DNA sequence processing and statistical analyses.

      Statistical methods to analyze the temporal trend are too simplistic. The methods used in the study did not consider possible autocorrelation and other structures that the eDNA time series might have. It is well known that the applications of simple linear models to time series with autocorrelation structure incorrectly detect a "significant" temporal trend. For example, a linear model can often detect a significant trend even in a random walk time series.

      We have now reanalyzed our data controlling for autocorrelation and for non-linear changes of abundance and recover no change to our results. We have added this information to the manuscript under “Statistical analysis” (lines 629-644).

      Also, there are some issues regarding the DNA sequence analysis and the subsequent use of the results. For example, read abundance was used in the statistical model, but the read abundance cannot be a proxy for species abundance/biomass. Because the total 18S DNA copy numbers of arthropods were quantified in the study, multiplying the sequence-based relative abundance by the total 18S DNA copy numbers may produce a better proxy of the abundance of arthropods, and the use of such a better proxy would be more appropriate here. In addition, a coverage-based rarefaction enables a more rigorous comparison of diversity (OTU diversity or haplotype diversity) than the readbased rarefaction does.

      We did not use read abundance as a proxy for abundance, but used our qPCR approach to measure relative copy number of arthropods. While there are biases to this (see our explanations above), the assay proved very reliable and robust. We thus believe it should indeed provide a rough estimate of biomass. As biomass is very commonly discussed in insect decline (in fact the first study on insect decline entirely relies on biomass; Hallmann et al. 2017), we feel it is important go include a proxy for this as well. However, we also discuss the alternative option that a turnover of diversity is affecting the measured biomass. A pattern of abundance loss for common species has been described in other works on insect decline.

      We liked the reviewer’s suggestion to use copy number information to perform abundance-informed rarefaction. We have done this now and added an additional analysis rarefying by copy number/biomass. A parallel analysis using this newly rarefied table was done for the total diversity as well as single species abundance change. Details can be found in the Methods and Results section of the manuscript. However, the result essentially remains the same. Even abundance-informed rarefaction does not lead to a pattern of loss of species richness over time (see “Statistical analysis”).

      The overall results are supporting a scenario of no overall loss of species richness over time, but a loss of abundance for common species. And we indeed see the pattern of declining abundance for once-common species in our data, for example the loss of the Green Silver-Line moth, once a very common species in beech canopy (Suppl. Fig. 10). We have added details on this to the Discussion (lines 254-260).

      These points may significantly impact the conclusions of this work.

      Reviewer #3 (Public Review):

      The aim of Weber and colleagues' study was to generate arthropod environmental DNA extracted from a unique 30-year time series of deep-frozen leaf material sampled at 24 German sites, that represent four different land use types. Using this dataset, they explore how the arthropod community has changed through time in these sites, using both conventional metabarcoding to reconstruct the OTUs present, and a new qPCR assay developed to estimate the overall arthropod diversity on the collected material. Overall their results show that while no clear changes in alpha diversity are found, the βdiversity dropped significantly over time in many sites, most notable in the beech forests. Overall I believe their data supports these findings, and thus their conclusion that diversity is becoming homogenized through time is valid.

      Thank you for the positive assessment.

      While overall I do not doubt the general findings, I have a number of comments. Firstly while I agree this is a very nice study on a unique dataset - other temporal datasets of insects that were used for eDNA studies do exist, and perhaps it would be relevant to put the findings into context (or even the study design) of other work that has been done on such datasets. One example that jumps to my mind is Thomsen et al. 2015 https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.12452 but I am sure there are others.

      We have expanded the introduction and discussion on this citing this among other studies now (lines 71-72, 276-278).

      From a technical point of view, the conclusions of course rely on several assumptions, including (1) that the biomass assay is effective and (2) that the reconstructed levels of OTU diversity are accurate,

      With regards to biomass although it is stated in the manuscript that "Relative eDNA copy number should be a predictor for relative biomass ", this is in fact only true if one assumes a number of things, e.g. there is a similar copy number of 18s rDNA per species, similar numbers of mtDNA per cell, a similar number of cells per individual species etc. In this regard, on the positive side, it is gratifying to see that the authors perform a validation assay on 7 mock controls, and these seem to indicate the assay works well. Given how critical this is, I recommend discussing the details of this a bit more, and why the authors are convinced the assay is effective in the main text so that the reader is able to fully decide if they are in agreement. However perhaps on the negative side, I am concerned about the strategy taken to perform the qPCR may have not been ideal. Specifically, the assay is based on nested PCR, where the authors first perform a 15cycle amplification, this product is purified, then put into a subsequent qPCR. Given how both PCR is notorious for introducing amplification biases in general (especially when performed on low levels of DNA), and the fact that nested PCRs are notoriously contamination prone - this approach seems to be asking for trouble. This raises the question - why not just do the qPCR directly on the extracts (one can still dilute the plant DNA 100x prior to qPCR if needed). Further, given the qPCRs were run in triplicate I think the full data (Ct values) for this should be released (as opposed to just stating in the paper that the average values were used). In this way, the readers will be able to judge how replicable the assay was - something I think is critical given how noisy the patterns in Fig S10 seem to be.

      We agree with this point, and this is why we do not want to overstate the decline in copy number. This is an additional source of data next to genetic and species diversity. We have added to our discussion of turnover as another potential driver of copy number change (lines 257-260). We have also added text addressing the robustness of the mock community assay (lines 138-141).

      However, we are confident of the reliability and robustness of our qPCR assay for the detection of relative arthropod copy number. We performed several validations and optimizations before using the assay. We have added additional details to the manuscript on this (see “Detection of relative arthropod DNA copy number using quantitative PCR”, lines 548-556). We got the idea for the nested qPCR from a study (Tran et al.) showing its high accuracy and reproducibility. We show that our assay has a very high replicability using triplicates of each qPCR, which we will now include in the supplementary data on Dryad. The SD of Ct values is very low (~ 0.1 on average). NTC were run with all qPCRs to rule out contamination as an issue in the experiments. We also find a very high efficiency of the assay. At dilutions far outside the observed copy number in our actual leaf data, we still find the assay to be accurate. We found very comparable abundance changes across our highly taxonomically diverse mock communities. This also suggests that abundance changes are a more likely explanation than simple turnover for the observed drop in copy number. A biomass loss for common species is well in line with recent reports on insect decline. We can also rely on several other mock community studies (Krehenwinkel et al. 2017 & 2019) where we used read abundance of 18S and found it to be a relatively good predictor of relative biomass.

      The pattern in Fig. S10 is not really noisy. It just reflects typical population fluctuations for arthropods. Most arthropod taxa undergo very pronounced temporal abundance fluctuations between years.

      Next, with regards to the observation that the results reveal an overall decrease in arthropod biomass over time: The authors suggest one alternate to their theory, that the dropping DNA copy number may reflect taxonomic turnover of species with different eDNA shedding rates. Could there be another potential explanation - simply be that leaves are getting denser/larger? Can this be ruled out in some way, e.g. via data on leaf mass through time for these trees? (From this dataset or indeed any other place).

      This is a very good point. However, we can rule out this hypothesis, as the ESB performs intensive biometric data analysis. The average leaf weight and water content have not significantly changed in our sites. We have addressed this in the Methods section (see ”Tree samples of the German Environmental Specimen Bank – Standardized time series samples stored at ultra-low temperatures”, lines 308-311).

      With regards to estimates of OTU/zOTU diversity. The authors state in the manuscript that zOTUs represent individual haplotypes, thus genetic variation within species. This is only true if they do not represent PCR and/or sequencing errors. Perhaps therefore they would be able to elaborate (for the non-computational/eDNA specialist reader) on why their sequence processing methods rule out this possibility? One very good bit of evidence would be that identical haplotypes for the individual species are found in the replicate PCRs. Or even between different extractions at single locations/timepoints.

      We have repeated the analysis of genetic variation with much more stringent filtering criteria (see “Statistical analysis”, lines 611-615). Among other filtering steps, this also includes the use of only those zOTUs that occur in both technical replicates, as suggested by the reviewer. Another reason to make us believe we are dealing with true haplotypic variation here is that haplotypes show geographic variation. E.g., some haplotypes are more abundant in some sites than in others. NUMTS would consistently show a simple correlation in their abundance with the most abundant true haplotype.

      With regards to the bigger picture, one thing I found very interesting from a technical point of view is that the authors explored how modifying the mass of plant material used in the extraction affects the overall results, and basically find that using more than 200mg provides no real advantage. In this regard, I draw the authors and readers attention to an excellent paper by Mata et al. (https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14779) - where these authors compare the effect of increasing the amount of bat faeces used in a bat diet metabarcoding study, on the OTUs generated. Essentially Mata and colleagues report that as the amount of faeces increases, the rare taxa (e.g. those found at a low level in a single faeces) get lost - they are simply diluted out by the common taxa (e.g those in all faeces). In contrast, increasing biological replicates (in their case more individual faecal samples) increased diversity. I think these results are relevant in the context of the experiment described in this new manuscript, as they seem to show similar results - there is no benefit of considerably increasing the amount of leaf tissue used. And if so, this seems to point to a general principal of relevance to the design of metabarcoding studies, thus of likely wide interest.

      Thank you for this interesting study, which we were not aware of before. The cryomilling is an extremely efficient approach to equally disperse even traces of chemicals in a sample. This has been established for trace chemicals early during the operation of the ESB, but also seems to hold true for eDNA in the samples. We have recently done more replication experiments from different ESB samples (different terrestrial and marine samples for different taxonomic groups) and find that replication of extraction does not provide much more benefit than replication of PCR. Even after 2 replicates, diversity approaches saturation. This can be seen in the plot below, which shows recovered eDNA diversity for different ESB samples and different taxonomic groups from 1-4 replicates. A single extract of a small volume contains DNA from nearly all taxa in the community. Rare taxa can be enriched with more PCR replicates.

    1. Author response

      Reviewer #1 (Public Review):

      This careful study reports the importance of Rab12 for Parkinson's disease associated LRRK2 kinase activity in cells. The authors carried out a targeted siRNA screen of Rab substrates and found lower pRab10 levels in cells depleted of Rab12. It has previously been reported that LLOMe treatment of cells breaks lysosomes and with time, leads to major activation of LRRK2 kinase. Here they show that LLOMe-induced kinase activation requires Rab12 and does not require Rab12 phosphorylation to show the effect.

      We thank the reviewer for their comments regarding the carefulness and importance of our work and for their specific feedback which has substantially improved our revised manuscript.

      1) Throughout the text, the authors claim that "Rab12 is required for LRRK2 dependent phosphorylation" (Page 4 line 78; Page 9 line 153; Page 22 line 421). This is not correct according to Figure 1 Figure Supp 1B - there is still pRab10. It is correct only in relation to the LLOMe activation. Please correct this error.

      We appreciate the reviewer’s comment around the requirement of Rab12 for LRRK2-dependent phosphorylation of Rab10 and question regarding whether this is relevant under baseline conditions or only in relation to LLOMe activation. Using our MSD-based assay to quantify pT73 Rab10 levels under basal conditions, we observed a similar reduction in Rab10 phosphorylation when we knockdown Rab12 as we also observed with LRRK2 knockdown (Figure 1A). Further, we see comparable reduction in Rab10 phosphorylation in RAB12 KO cells as that observed in LRRK2 KO cells using our MSD-based assay (Figure 2A and B). Based on this data, we believe Rab12 is a key regulator of LRRK2 activation under basal conditions without additional lysosomal damage. However, as the reviewer noted, we do observe some residual Rab10 phosphorylation upon Rab12 knockdown when assessed by western blot analysis (Figure 1D and Figure 1- figure supplement 1). A similar signal is observed upon LRRK2 knockdown, which may suggest that some small amount of Rab10 phosphorylation may be mediated by another kinase in this cell model. Nevertheless, we appreciate this reviewer’s point and have therefore modified the text to remove any reference to Rab12 being required for LRRK2-dependent Rab phosphorylation and now instead refer to Rab12 as a regulator of LRRK2 activity.

      As noted by the reviewer, our data does suggest that Rab12 is required for the increase in Rab10 phosphorylation observed following LLOMe treatment to elicit lysosomal damage, and we now refer to this appropriately throughout the text.

      2) The authors conclude that Rab12 recruitment precedes that of LRRK2 but the rate of recruitment (slopes of curves in 3F and G) is actually faster for LRRK2 than for Rab12 with no proof that Rab12 is faster-please modify the text-it looks more like coordinated recruitment.

      The reviewer raises an excellent point regarding our ability to delineate whether Rab12 recruitment precedes that of LRRK2 on lysosomes following LLOMe treatment. As noted by the reviewer, we do see both the recruitment of Rab12 and LRRK2 to lysosomes increase on a similar timescale, so we cannot truly resolve whether Rab12 recruitment precedes LRRK2 recruitment in our studies. Based on this, we have modified the text to emphasize that this data supports coordinated recruitment, as suggested, and we have further removed any mention of Rab12 preceding LRRK2. The specific change is as follows “Rab12 colocalization with LRRK2 increased over time following LLOMe treatment, supporting potential coordinated recruitment of these proteins to lysosomes upon damage (Figure 3I). Together, these data demonstrate that Rab12 and LRRK2 both associate with lysosomes following membrane rupture.” and can be found on lines 460-463 of the updated manuscript.

      3) The title is misleading because the authors do not show that Rab12 promotes LRRK2 membrane association. This would require Rab12 to be sufficient to localize LRRK2 to a mislocalized Rab12. The authors DO show that Rab12 is needed for the massive LLOME activation at lysosomes. Please re-word the title.

      To address the reviewer’s concern regarding the title of our manuscript, we have modified the title from “Rab12 regulates LRRK2 activity by promoting its localization to lysosomes” to “Rab12 regulates LRRK2 activity by facilitating its localization to lysosomes” to soften the language around the sufficiency of Rab12 in regulating the localization of LRRK2 to lysosomes. We show that Rab12 deletion significantly reduces LRRK2 activity (as assessed by Rab10 phosphorylation on lysosomes) and significantly increases the localization of LRRK2 to lysosomes upon lysosomal damage. The updated title better reflects the regulatory role of Rab12 in modulating LRRK2 activity, and we thank the reviewer for their suggestion to modify this accordingly.

      Reviewer #2 (Public Review):

      This study shows that rab12 has a role in the phosphorylation of rab10 by LRRK2. Many publications have previously focused on the phosphorylation targets of LRRK2 and the significance of many remains unclear, but the study of LRRK2 activation has mostly focused on the role of disease-associated mutations (in LRRK2 and VPS35) and rab29. The work is performed entirely in an alveolar lung cell line, limiting relevance for the nervous system. Nonetheless, the authors take advantage of this simplified system to explore the mechanism by which rab12 activates LRRK2. In general, the work is performed very carefully with appropriate controls, excluding trivial explanations for the results, but there are several serious problems with the experiments and in particular the interpretation.

      We appreciate the reviewer’s comments regarding the rigor of our work and the potential impact of our studies to address a key unanswered question in the field regarding the mechanisms by which LRRK2 activation is mediated. Our studies focused on the A549 cell model given its high endogenous expression of LRRK2 and Rab10, and this cell line provided a simple system to investigate the mechanism and impact of Rab12-dependent regulation of LRRK2 activity. We agree with the reviewer that future studies are warranted to understand whether similar Rab12-dependent regulation of LRRK2 occurs in relevant CNS cell types.

      First, the authors note that rab29 appears to have a smaller or no effect when knocked down in these cells. However, the quantitation (Fig1-S1A) shows a much less significant knockdown of rab29 than rab12, so it would be important to repeat this with better knockdown or preferably a KO (by CRISPR) before making this conclusion. And the relationship to rab29 is important, so if a better KD or KO shows an effect, it would be important to assess by knocking down rab12 in the rab29 KO background.

      The reviewer raises a good point regarding the importance of confirming that loss of Rab29 has no effect on Rab10 phosphorylation. To address potential concerns about insufficient Rab29 knockdown, we measured the levels of pT73 Rab10 in RAB29 KO A549 cells by MSD-based analysis. RAB29 deletion had no effect on Rab10 phosphorylation, confirming findings from our RAB siRNA screen and the observations of Dario Alessi’s group reported previously (Kalogeropulou et al Biochem J 2020; PMID: 33135724). We have included this new data into our updated manuscript in Figure 1- figure supplement 1 and comment on it on page 6 in the updated Results section.

      Secondly, the knockdown of rab12 generally has a strong effect on the phosphorylation of the LRRK2 substrate rab10 but I could not find an experiment that shows whether rab12 has any effect on the residual phosphorylation of rab10 in the LRRK2 KO. There is not much phosphorylation left in the absence of LRRK2 but maybe this depends on rab12 just as much as in cells with LRRK2 and rab12 is operating independently of LRRK2, either through a different kinase or simply by making rab10 more available for phosphorylation. The epistasis experiment is crucial to address this possibility. To establish the connection to LRRK2, it would also help to compare the effect of rab12 KD on the phosphorylation of selected rabs that do or do not depend on LRRK2.

      The reviewer raises an interesting question regarding whether Rab12 can further reduce Rab10 phosphorylation independently of LRRK2. Using our quantitative MSD-based assay, we observe that pRab10 levels are at the lower limits of detection of the assay in LRRK2 KO A549 cells. Unfortunately, this means that we are unable to detect whether there might be any additional minor reduction in Rab10 phosphorylation with Rab12 knockdown in LRRK2 KO cells. We cannot rule out that Rab12 may play a LRRK2-independent role in regulating Rab10 phosphorylation in other cell lines, and future studies are warranted to explore whether Rab12 knockdown can further reduce Rab10 phosphorylation in other systems, including in CNS cells.

      Regarding exploring the effects of RAB12 knockdown on the phosphorylation of other Rabs, we also assessed the impact of RAB12 KO on phosphorylation of another LRRK2-Rab substrate, Rab8a. We observed a strong reduction in pT72 Rab8a levels in RAB12 KO cells compared to wildtype cells, suggesting the impact of RAB12 deletion extends beyond Rab10 (see representative western blot in Author response image 1). Due to potential concerns with the selectivity of the pT72 Rab8a antibody (potentially detecting the phosphorylation of other LRRK2-Rabs), we cannot definitively demonstrate that Rab12 mediates the phosphorylation of other Rabs. This question should be revisited when additional phospho-Rab antibodies become available that enable us to selectively detect LRRK2-dependent phosphorylation of additional Rab substrates under endogenous expression conditions.

      Author response image 1.

      A strength of the work is the demonstration of p-rab10 recruitment to lysosomes by biochemistry and imaging. The demonstration that LRRK2 is required for this by biochemistry (Fig 4A) is very important but it would also be good to determine whether the requirement for LRRK2 extends to imaging. In support of a causal relationship, the authors also state that lysosomal accumulation of rab12 precedes LRRK2 but the data do not show this. Imaging with and without LRRK2 would provide more compelling evidence for a causative role.

      We thank the reviewer for their suggestion to assess Rab12 recruitment to damaged lysosomes with and without LRRK2 using imaging-based analyses to add confidence to our findings from biochemical approaches. To address this comment, we have imaged the recruitment of mCherry-tagged Rab12 to lysosomes (as assessed using an antibody against endogenous LAMP1) and observed a significant increase in Rab12 levels on lysosomes following LLOMe treatment. This occurs to a similar extent in LRRK2 KO A549 cells, suggesting that Rab12 is an upstream regulator of LRRK2 activity. This new data has been incorporated into the revised manuscript (Figure 3E) and is presented on page 20 of the updated manuscript.

      Our conclusions on this are further strengthened by new data assessing Rab12 recruitment to lysosomes using orthogonal analysis of isolated lysosomes biochemically. Using the Lyso-IP method, we observed a strong increase in the levels of Rab12 on lysosomes following LLOMe treatment that was maintained in LRRK2 KO cells. These data have been added to the updated manuscript (new data added to Figure 3- figure supplement 1).

      Together, these data support our hypothesis that Rab12 recruitment to damaged lysosomes is upstream, and independent, of LRRK2.

      The authors also touch base with PD mutations, showing that loss of rab12 reduces the phosphorylation of rab10. However, it is interesting that loss of rab12 has the same effect with R1441G LRRK2 and D620N VPS35 as it does in controls. This suggests that the effect of rab12 does not depend on the extent of LRRK2 activation. It is also surprising that R1441G LRRK2 does not increase p-rab10 phosphorylation (Fig 2G) as suggested in the literature and stated in the text.

      We agree with the reviewer that it is quite interesting that RAB12 knockdown significantly attenuates Rab10 phosphorylation in the context of PD-linked variants in addition to that observed in wildtype cells basally and after LLOMe treatment. As noted by the reviewer, we did not observe increased levels of phospho-Rab10 in LRRK2 R1441G KI A549 cells at the whole cell level (Figure 2G). However, we observed a significant increase in Rab10 phosphorylation on isolated lysosomes from LRRK2 R1441G KI cells compared to WT cells (Figure 4B). This may suggest that the LRRK2 R1441G variant leads to a more modest increase in LRRK2 activity in this cell model. Previous studies in MEFs from LRRK2 R1441G KI mice or neutrophils from human subjects that carry the LRRK2 R1441G variant showed a 3-4 fold increase in Rab10 phosphorylation (Fan et al Acta Neuropathol 2021 PMID: 34125248 and Karaye et al Mol Cell Proteomics 2020 PMID: 32601174), supporting that this variant does lead to increased Rab10 phosphorylation and that the extent of LRRK2 activation may vary across different cell types.

      Most important, the final figure suggests that PD-associated mutations in LRRK2 and VPS35 occlude the effect of lysosomal disruption on lysosomal recruitment of LRRK2 (Fig 4D) but do not impair the phosphorylation of rab10 also triggered by lysosomal disruption (4A-C). Phosphorylation of this target thus appears to be regulated independently of LRRK2 recruitment to the lysosome, suggesting another level of control (perhaps of kinase activity rather than localization) that has not been considered.

      The reviewer suggests an interesting hypothesis around the existence of additional levels of control beyond the lysosomal levels of LRRK2 to lead to increased Rab10 phosphorylation of lysosomes. Given the variability we have observed in measuring endogenous LRRK2 levels on lysosomes, we performed two additional replicates to assess lysosomal LRRK2 levels in LRRK2 R1441G KI and VPS35 D620N KI cells at baseline and after treatment with LLOMe. We observed a significant increase in LRRK2 levels on lysosomes in cells expressing either PD-linked variant and a trend toward a further increase in the levels of LRRK2 on lysosomes after LLOMe treatment in these cells (Figure 4D in the updated manuscript). We have updated the text on page 24 to reflect this change, suggesting that the PD-linked variants do not fully occlude the effect of lysosomal disruption on the lysosomal recruitment of LRRK2.

      LLOMe treatment leads to a stronger increase in Rab10 phosphorylation on lysosomes from LRRK2 R1441G and VPS35 D620N cells compared to the modest increase in LRRK2 levels observed. This could suggest that, as the reviewer noted, additional mechanisms beyond increased lysosomal localization of LRRK2 may be driving the robust increase in Rab10 phosphorylation observed. We have modified the results section on lines 548-551 to highlight this possibility: “Rab10 phosphorylation showed a more significant increase in response to LLOMe treatment than LRRK2 on lysosomes from LRRK2 R1441G and VPS35 D620N KI cells, suggesting that there may be more regulation beyond the enhanced proximity between LRRK2 and Rab that contribute to LRRK2 activation in response to lysosomal damage.”

      Reviewer #3 (Public Review):

      Increased LRRK2 kinase activity is known to confer Parkinson's disease risk. While much is known about disease-causing LRRK2 mutations that increase LRRK2 kinase activity, the normal cellular mechanisms of LRRK2 activation are less well understood. Rab GTPases are known to play a role in LRRK2 activation and to be substrates for the kinase activity of LRRK2. However, much of the data on Rabs in LRRK2 activation comes from over-expression studies and the contributions of endogenously expressed Rabs to LRRK2 activation are less clear. To address this problem, Bondar and colleagues tested the impact of systematically depleting candidate Rab GTPases on LRRK2 activity as measured by its ability to phosphorylate Rab10 in the human A549 type 2 pneumocyte cell line. This resulted in the identification of a major role for Rab12 in controlling LRRK2 activity towards Rab10 in this model system. Follow-up studies show that this role for Rab12 is of particular importance for the phosphorylation of Rab10 by LRRK2 at damaged lysosomes. Increases in LRRK2 activity in cells harboring disease-causing mutants of LRRK2 and VPS35 also depend (at least partially) on Rab12. Confidence in the role of Rab12 in supporting LRRK2 activity is strengthened by parallel experiments showing that either siRNA-mediated depletion of Rab12 or CRISPR-mediated Rab12 KO both have similar effects on LRRK2 activity. Collectively, these results demonstrate a novel role for Rab12 in supporting LRRK2 activation in A549 cells. It is likely that this effect is generalizable to other cell types. However, this remains to be established. It is also likely that lysosomes are the subcellular site where Rab12-dependent activation of LRRK2 occurs. Independent validation of these conclusions with additional experiments would strengthen this conclusion and help to address some concerns that much of the data supporting a lysosome localization for Rab12-dependent activation of LRRK2 comes from a single method (LysoIP). Furthermore, there is a discrepancy between panel 4A versus 4D in the effect of LLoMe-induced lysosome damage on LRRK2 recruitment to lysosomes that will need to be addressed to strengthen confidence in conclusions about lysosomes as sites of LRRK2 activation by Rab12.

      We thank the reviewer for their comments regarding our work that identifies Rab12 as a novel regulator of LRRK2 activation and the appreciation of the parallel approaches we employed to add confidence in this effect.

      As suggested by the reviewer, we have updated our manuscript to now include independent validation of our conclusions using imaging-based analyses to complement our data from biochemical analyses using the Lyso-IP method. Specifically, we have included new imaging data that confirms that Rab12 levels are increased on lysosomes following membrane permeabilization with LLOMe treatment and demonstrates that this occurs independent of LRRK2, providing additional support that Rab12 is an upstream regulator of LRRK2 activity (Figure 3E in the updated manuscript).

      Regarding the reviewer’s comment on a discrepancy between our findings in Figure 4A and Figure 4D, we have performed additional independent replicates in Figure 4D to assess the impact of lysosomal damage on the lysosomal levels of LRRK2 at baseline or upon the expression of genetic variants. We observed a significant increase in LRRK2 levels on lysosomes following LLOMe treatment in our set of experiments included in Figure 4A and a non-significant trend toward an increase in LRRK2 levels on isolates lysosomes in Figure 4D. As described in more detail below (in response to the second point raised by this reviewer), we think this variability arises because of a combination of low levels of LRRK2 on lysosomes with endogenous expression and variability across experiments in the efficiency of lysosomal isolation. Our observations of increased recruitment of LRRK2 to lysosomes upon damage are further supported by parallel imaging-based studies (Figure 3F-I) and are consistent with previous studies using overexpression systems.

      We thank the reviewer for all of the suggestions which have added further confidence to our conclusions and substantially improved the manuscript.

    1. Author Response

      eLife assessment

      This important paper exploits new cryo-EM tomography tools to examine the state of chromatin in situ. The experimental work is meticulously performed and convincing, with a vast amount of data collected. The main findings are interpreted by the authors to suggest that the majority of yeast nucleosomes lack a stable octameric conformation. Despite the possibly controversial nature of this report, it is our hope that such work will spark thought-provoking debate, and further the development of exciting new tools that can interrogate native chromatin shape and associated function in vivo.

      We thank the Editors and Reviewers for their thoughtful and helpful comments. We also appreciate the extraordinary amount of effort needed to assess both the lengthy manuscript and the previous reviews. Below, we provide our provisional responses in bold blue font. The majority of the comments are straightforward to address. We have taken a more conservative approach with the subset of comments that would require us to speculate because we either lack key information or we lack technical expertise. Instead of adding the speculative replies to the main text, we think it will be better to leave them in the rebuttal for posterity. Readers will therefore have access to our speculation and know that we did not feel confident enough to include these thoughts in the Version of Record.

      Reviewer #1 (Public Review):

      This manuscript by Tan et al is using cryo-electron tomography to investigate the structure of yeast nucleosomes both ex vivo (nuclear lysates) and in situ (lamellae and cryosections). The sheer number of experiments and results are astounding and comparable with an entire PhD thesis. However, as is always the case, it is hard to prove that something is not there. In this case, canonical nucleosomes. In their path to find the nucleosomes, the authors also stumble over new insights into nucleosome arrangement that indicates that the positions of the histones is more flexible than previously believed.

      We want to point out that canonical nucleosomes are there in wild-type cells in situ, albeit rarer than what’s expected based on our HeLa cell analysis. The negative result (absence of any canonical nucleosome classes in situ) was found in the histone-GFP mutants.

      Major strengths and weaknesses:

      Personally, I am not ready to agree with their conclusion that heterogenous non-canonical nucleosomes predominate in yeast cells, but this reviewer is not an expert in the field of nucleosomes and can't judge how well these results fit into previous results in the field. As a technological expert though, I think the authors have done everything possible to test that hypothesis with today's available methods. One can debate whether it is necessary to have 35 supplementary figures, but after working through them all, I see that the nature of the argument needs all that support, precisely because it is so hard to show what is not there. The massive amount of work that has gone into this manuscript and the state-of-the art nature of the technology should be warmly commended. I also think the authors have done a really great job with including all their results to the benefit of the scientific community. Yet, I am left with some questions and comments:

      Could the nucleosomes change into other shapes that were predetermined in situ? Could the authors expand on if there was a structure or two that was more common than the others of the classes they found? Or would this not have been found because of the template matching and later reference particle used?

      Our best guess (speculation) is that one of the class averages that is smaller than the canonical nucleosome contains one or more non-canonical nucleosome classes. We do not feel confident enough to single out any of these classes precisely because we do not yet know if they arise from one non-canonical nucleosome structure or from multiple – and therefore mis-classified – non-canonical nucleosome structures (potentially with other non-nucleosome complexes mixed in). We feel it is better to leave this discussion out of the manuscript, or risk sending the community on wild goose chases.

      Our template-matching workflow uses a low-enough cross-correlation threshold that any nucleosome-sized particle (plus minus a few nanometers) would be picked, which is why the number of hits is so large. So unless the noncanonical nucleosomes quadrupled in size or lost most of their histones, they should be grouped with one or more of the other 99 class averages (WT cells) or any of the 100 class averages (cells with GFP-tagged histones). As to whether the later reference particle could have prevented us from detecting one of the non-canonical nucleosome structures, we are unable to tell because we’d really have to know what an in situ non-canonical nucleosome looks like first.

      Could it simply be that the yeast nucleoplasm is differently structured than that of HeLa cells and it was harder to find nucleosomes by template matching in these cells? The authors argue against crowding in the discussion, but maybe it is just a nucleoplasm texture that side-tracks the programs?

      Presumably, the nucleoplasmic “side-tracking” texture would come from some molecules in the yeast nucleus. These molecules would be too small to visualize as discrete particles in the tomographic slices, but they would contribute textures that can be “seen” by the programs – in particular RELION, which does the discrimination between structural states. We do not know the inner-workings of RELION well enough to say what kinds of density textures would side-track its classification routines.

      The title of the paper is not well reflected in the main figures. The title of Figure 2 says "Canonical nucleosomes are rare in wild-type cells", but that is not shown/quantified in that figure. Rare is comparison to what? I suggest adding a comparative view from the HeLa cells, like the text does in lines 195-199. A measure of nucleosomes detected per volume nucleoplasm would also facilitate a comparison.

      Figure 2’s title is indeed unclear and does not align with the paper’s title and key conclusion. The rarity here is relative to the expected number of nucleosomes (canonical plus non-canonical). We have changed the title to “Canonical nucleosomes are a minority of the expected total in wild-type cells”. We would prefer to leave the reference to HeLa cells to the main text instead of as a figure panel because the comparison is not straightforward for a graphical presentation. Instead, we will report the total number of nucleosomes estimated for this particular tomogram (~7,600) versus the number of canonical nucleosomes classified (297; 594 if we assume we missed half of them).

      If the cell contains mostly non-canonical nucleosomes, are they really non-canonical? Maybe a change of language is required once this is somewhat sure (say, after line 303).

      This is an interesting semantic and philosophical point. From the yeast cell’s “perspective”, the canonical nucleosome structure would be the form that is in the majority. That being said, we do not know if there is one structure that is the majority. From the chromatin field’s point of view, the canonical nucleosome is the form that is most commonly seen in all the historical – and most contemporary – literature, namely something that resembles the crystal structure of Luger et al, 1997. Given these two lines of thinking, we will add the following clarification after line 303:

      “At present, we do not know what the non-canonical nucleosome structures are, meaning that we cannot even determine if one non-canonical structure is the majority. Until we know what the family of non-canonical nucleosome structures are, we will use the term non-canonical to describe the nucleosomes that do not have the canonical (crystal) structure”.

      The authors could explain more why they sometimes use conventional the 2D followed by 3D classification approach and sometimes "direct 3-D classification". Why, for example, do they do 2D followed by 3D in Figure S5A? This Figure could be considered a regular figure since it shows the main message of the paper.

      Because the classification of subtomograms in situ is still a work in progress, we felt it would be better to show one instance of 2-D classification for lysates and one for lamellae. While it is true that we could have presented direct 3-D classification for the entire paper, we anticipate that readers will be interested to see what the in situ 2-D class averages look like.

      The main message is that there are canonical nucleosomes in situ (at least in wild-type cells), but they are a minority. Therefore, the conventional classification for Figure S5A should not be a main figure because it does not show any canonical nucleosome class averages in situ.

      Figure 1: Why is there a gap in the middle of the nucleosome in panel B? The authors write that this is a higher resolution structure (18Å), but in the even higher resolution crystallography structure (3Å resolution), there is no gap in the middle.

      There is a lower concentration of amino acids at the middle in the disc view; unfortunately, the space-filling model in Figure 1A hides this feature. The gap exists in experimental cryo-EM density maps. See below for an example. The size of the gap depends on the contour level and probably the contrast mechanism, as the gap is less visible in the VPP subtomogram averages. To clarify this confusing phenomenon, we will add the following lines to the figure legend:

      “The gap in the disc view of the nuclear-lysate-based average is due to the lower concentration of amino acids there, which is not visible in panel A due to space-filling rendering. This gap’s size may depend on the contrast mechanism because it is not visible in the VPP averages.”

      Reviewer #2 (Public Review):

      Nucleosome structures inside cells remain unclear. Tan et al. tackled this problem using cryo-ET and 3-D classification analysis of yeast cells. The authors found that the fraction of canonical nucleosomes in the cell could be less than 10% of total nucleosomes. The finding is consistent with the unstable property of yeast nucleosomes and the high proportion of the actively transcribed yeast genome. The authors made an important point in understanding chromatin structure in situ. Overall, the paper is well-written and informative to the chromatin/chromosome field.

      We thank Reviewer 2 for their positive assessment.

      Reviewer #3 (Public Review):

      Several labs in the 1970s published fundamental work revealing that almost all eukaryotes organize their DNA into repeating units called nucleosomes, which form the chromatin fiber. Decades of elegant biochemical and structural work indicated a primarily octameric organization of the nucleosome with 2 copies of each histone H2A, H2B, H3 and H4, wrapping 147bp of DNA in a left handed toroid, to which linker histone would bind.

      This was true for most species studied (except, yeast lack linker histone) and was recapitulated in stunning detail by in vitro reconstitutions by salt dialysis or chaperone-mediated assembly of nucleosomes. Thus, these landmark studies set the stage for an exploding number of papers on the topic of chromatin in the past 45 years.

      An emerging counterpoint to the prevailing idea of static particles is that nucleosomes are much more dynamic and can undergo spontaneous transformation. Such dynamics could arise from intrinsic instability due to DNA structural deformation, specific histone variants or their mutations, post-translational histone modifications which weaken the main contacts, protein partners, and predominantly, from active processes like ATP-dependent chromatin remodeling, transcription, repair and replication.

      This paper is important because it tests this idea whole-scale, applying novel cryo-EM tomography tools to examine the state of chromatin in yeast lysates or cryo-sections. The experimental work is meticulously performed, with vast amount of data collected. The main findings are interpreted by the authors to suggest that majority of yeast nucleosomes lack a stable octameric conformation. The findings are not surprising in that alternative conformations of nucleosomes might exist in vivo, but rather in the sheer scale of such particles reported, relative to the traditional form expected from decades of biochemical, biophysical and structural data. Thus, it is likely that this work will be perceived as controversial. Nonetheless, we believe these kinds of tools represent an important advance for in situ analysis of chromatin. We also think the field should have the opportunity to carefully evaluate the data and assess whether the claims are supported, or consider what additional experiments could be done to further test the conceptual claims made. It is our hope that such work will spark thought-provoking debate in a collegial fashion, and lead to the development of exciting new tools which can interrogate native chromatin shape in vivo. Most importantly, it will be critical to assess biological implications associated with more dynamic - or static forms- of nucleosomes, the associated chromatin fiber, and its three-dimensional organization, for nuclear or mitotic function.

      Thank you for putting our work in the context of the field’s trajectory. We hope our EMPIAR entry, which includes all the raw data used in this paper, will be useful for the community. As more labs (hopefully) upload their raw data and as image-processing continues to advance, the field will be able to revisit the question of non-canonical nucleosomes in budding yeast and other organisms.

    1. Author Response:

      Reviewer #1:

      The manuscript by Jasmien Orije and colleagues has used advanced Diffusion Tensor and Fixel-Based brain imaging methods to examine brain plasticity in male and female European starlings. Songbirds provide a unique animal model to interrogate how the brain controls a complex, learned behaviour: song. The authors used DT imaging to identify known and uncover new structural changes in grey and white matter in male and female brains. The choice of the European starling as a model songbird was smart as this bird has a larger brain to facilitate anatomical localization, clear sex differences in song behavior and well-characterized photoperiod-induced changes in reproductive state. The authors are commended for using both male and female starlings. The photoperiodic treatment used was optimal to capture the key changes in physiological state. The high sampling frequency provides the capability to monitor key changes in physiology, behaviour and brain anatomy. Two exciting findings was the increased role of cerebellum and hippocampal recruitment in female birds engaged in singing behaviour. The development of non-invasive, multi-sampling brain imaging in songbirds provides a major advancement for studies that seek to understand the mechanism that control the motivation and production of singing behavior. The methods described herein set the foundation to develop targeted hypotheses to study how the vocal learning, such as language, is processed in discrete brain regions. Overall, the data presented in the study is extensive and includes a comprehensive analyses of regulated changes in brain microstructural plasticity in male and female songbirds.

      Reviewer #2:

      Orije et al. employed diffusion weighted imaging to longitudinally monitor the plasticity of the song control system during multiple photoperiods in male and female starlings. The authors found that both sexes experience similar seasonal neuroplasticity in multisensory systems and cerebellum during the photosensitive phase. The authors' findings are convincing and rely on a set of well-designed longitudinal investigations encompassing previously validated imaging methods. The authors' identification of a putative sensitive window during which sensory and motor systems can be seasonally re-shaped in both sexes is an interesting finding that advances our understanding of the neural basis of seasonal structural neuroplasticity in songbirds.

      Overall, this is a strong paper whose major strengths are:

      1) The longitudinal and non-invasive measure of plasticity employed

      2) The use of two complementary MR assays of white matter microplasticity

      3) The careful experimental design

      4) The sound and balanced interpretation of the imaging findings

      I do not have any major criticism but just a few minor suggestions:

      1) Pp 6-7. While the comparative description of canonical DTI with respect to fixel-based analysis is well written and of interest to readers with formal training in MR imaging, I found this entire section (and especially the paragraphs in page 7) too technical and out of context in a manuscript that is otherwise fundamentally about neuroplasticity in song birds. The accessibility of this manuscript to non-MR experts could be improved by moving this paragraph into the methods section, or by including it as supplemental material.

      The main purpose of this section was to introduce and explain the diffusion parameters which are used throughout the rest of the paper. Furthermore, we wanted to familiarize the reader with the concept of the population based template and the different structures that can be visualized by them. We agree that the technical details might have distracted from this main message. Therefore, we have trimmed the technical details out of this section and left a short explanation of the biological relevance of the different diffusion parameters and the anatomical structures visible on the population template. The technical details that were taken out are now a part of the material and methods section.

      The section now reads as follows:

      In the current study, we analyzed the DWI scans in two distinct ways: 1) using the common approach of diffusion tensor derived metrics such as fractional anisotropy (FA) and; 2) using a novel method of fiber orientation distribution (FOD) derived fixel-based analysis. Both techniques infer the microstructural information based on the diffusion of water molecules, but they are conceptually different (table 1). Common DTI analysis extracts for each voxel several diffusion parameters, which are sensitive to various microstructural changes in both grey and white matter specified in table 1. Fixel-based analysis on the other hand explores both microscopic changes in apparent fiber density (FD) or macroscopic changes in fiber-bundle cross-section (log FC) (table 1). Positive fiber-bundle cross-section values indicate expansion, whereas negative values reflect shrinkage of a fiber bundle relative to the template (Raffelt, Tournier et al. 2017).

      A population-based template created for the fixel-based analysis can be used as a study based atlas in which many of the avian anatomical structures can be identified (figure 2). We recognize many of the white matter structures such as the different lamina, occipito-mesencephalic tract (OM) and optic tract (TrO) among others. Interestingly, many of the nuclei within the song control system (i.e. HVC, robust nucleus of the arcopallium (RA), lateral magnocellular nucleus of the anterior nidopallium (LMAN), and Area X), auditory system (i.e. intercollicular nucleus complex, nucleus ovoidalis) and visual system (i.e. entopallium, nucleus rotundus) are identified by the empty spaces between tracts. The applied fixel-based approach is inherently sensitive to changes in white matter and cannot report on the microstructure within grey matter like brain nuclei; but rather sheds light on the fiber tracts surrounding and interconnecting them. As such, it provides an excellent tool to investigate neuroplasticity of different brain networks, and in the case of a nodular song control system focusing on changes in the fibers surrounding the song control nuclei, referred to as HVC surr, RA surr and Area X surr.

      2) Similarly, many sections, especially results, are in my opinion too detailed and analytical. While the employed description has the benefit of being systematic and rigorous, the ensuing narrative tends to be very technical and not easily interpretable by non experts. I think the manuscript may be substantially shortened (by at least 20% e.g. by removing overly technical or analytical descriptions of all results and regions affected) without losing its appeal and impact, but instead gaining in strength and focus especially if the new result narrative were aimed to more directly address the interesting set of questions the authors define in the introductory sections.

      We rewrote the result section, taking out the statistic reporting when it was also reported in a figure to reduce the bulk of this section and make it more readable. We made some of the descriptions of the regions affected more approachable by replacing it with parts of the discussion. This way we incorporated some of the explanations why certain findings are unexpected or relevant, as suggested by reviewer #3. Parts of text that were originally in the discussion are indicated in purple.

      3) The possible effect of brain size has been elegantly controlled by using a medial split approach. Have the authors considered using tensor-based morphometry (i.e. using the 3D RARE scans they acquired) to account for where in the brain the small differences in brain size occur? That could be more informative and sensitive than a whole-brain volume quantification.

      We have taken into consideration to add tensor-based morphometry, but we feel that log FC calculated with MrTrix can provide a similar account of the localization of these brain differences. Both methods are based on the Jacobean warps created between the individual images and the population template. They only differ in the starting images they use (3D RARE images in tensor-based morphometry or diffusion weighted images in log FC metric of MrTrix3) and the fact that MrTrix3 limits itself to the volume changes along a certain tract.

      The log FC difference in figure 4 gives a similar account of the differences in brain size between both sexes. Additionally, figure 6 indicates the log FC differences between small and large brain birds.

      4) I think Figures Fig. 3 and Fig. 4 may benefit from a ROI-based quantification of parameters of interests across groups (similar to what has been done for Fig. 7 and its related Fig. 8). This could help readers assess the biological relevance of the parameter mapped. For instance, in Fig. 3, most FA differences are taking place in low FA (i.e. gray matter dense?) regions.

      We supplied the figures with extracted ROI-based parameters of figure 3 and figure 4. In line with this reasoning we also added the same kind of supplementary figures for figure 5 and 6.

      Figure 3 - figure supplement 1: Overview of the fractional anisotropy (FA) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the fractional anisotropy values are not significantly different from each other.

      Figure 4 – figure supplement 2: Overview of the fiber density (FD) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the FD values are not significantly different from each other. Abbreviations: surr, surroundings.

      Figure 4 –figure supplement 3: Overview of the fiber-bundle cross-section (log FC) changes over time extracted from the relevant ROI-based clusters with significant sex differences. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant sex differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the log FC values are not significantly different from each other. Abbreviations: surr, surroundings.

      Figure 5 – figure supplement 1: Overview of the fractional anisotropy (FA) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the fractional anisotropy values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      Figure 6- figure supplement 2: Overview of the fiber density (FD) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the FD values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      Figure 6- figure supplement 3: Overview of the fiber-bundle cross-section (log FC) changes over time in extracted from the relevant ROI-based clusters with significant differences in brain size. The grey area indicates the entire photosensitive period of short days (8L:16D). Significant brain size differences are reported with their p-value under the respective ROI-based cluster. Different letters denote significant differences by comparison with each other in post-hoc t-tests with p < 0.05 (Tukey’s HSD correction for multiple comparisons) comparing the different time points to each other. If two time points share the same letter, the log FC values are not significantly different from each other. Abbreviations: C, caudal; surr, surroundings.

      5) In Abstract: "We longitudinally monitored the song and neuroplasticity in male.." Perhaps something should be specified after the "the song"? Did the authors mean "the neuroplasticity of song system"?

      No, this is not what we meant, we monitor song behavior and neuroplasticity independently. In our study, we do not limit ourselves to the neuroplasticity of the song system, but instead use a whole brain approach. The monitoring of the song behavior in itself might be useful for other songbird researchers.

      We clarified this in the abstract as follows:

      We longitudinally monitored the song behavior and neuroplasticity in male and female starlings during multiple photoperiods using Diffusion Tensor and Fixel-Based techniques.

      Reviewer #3:

      In their paper, Orije et al used MRI imaging to study sexual dimorphisms in brains of European starlings during multiple photoperiods and how this seasonal neuroplasticity is dependent in brain size, song rates and hormonal levels. The authors main findings include difference in hemispheric asymmetries between the sexes, multisensory neuroplasticity in the song control system and beyond it in both sexes and some dependence of singing behavior in females with large brains. The authors use different methods to quantify the changes in the MRI data to support various possible mechanisms that could be the basis of the differences they see. They also record the birds' song rates and hormonal levels to correlate the neural findings with biological relevant variables.

      The analysis is very impressive, taking into account the massive data set that was recorded and processed. Whole-brain data driven analysis prevented the authors from being biased to well-known sexually dimorphic brain areas. Sampling of a large number of subjects across many time points allowed for averaging in cases where individual measurements could not show statistical significance. The conclusions of the paper are mostly well supported by data (except of some confounds that the authors mention in the text). However, the extensive statistically significant results that are described in the paper, make it hard to follow at times.

      1) In the introduction the authors mention the pre optic area as a mediator for increase singing and therefore seasonal neuroplasticity. Did the authors find any differences in that area or other well know nuclei that are involved in courtship (PAG for example)?

      Interestingly, we did not detect any seasonal changes in the pre-optic area or PAG. Whereas prior studies reported volume changes in the POM within 1-2 days after testosterone administration in canaries (Shevchouk, Ball et al. 2019). In male European starlings, POM volumes changed seasonally, although this seems to depend on whether or not the males possessed a nest box (Riters, Eens et al. 2000). In our setup, our starlings are not provided with nest boxes. The lack of seasonal change in POM could have a biological reason, besides the limitations of our methodology. Since these are small regions and are grey matter like structures, they are less likely to be picked up with our diffusion MRI methods.

      2) Following the first comment, what is the minimum volume of an area of interest that could be detected using the voxel analysis?

      The up-sampled voxel size is (0.1750.1750.175) mm3. In the voxel-based statistical analysis a significance threshold is set at a cluster size of minimum 10 voxels: 0.05 mm3.

      3) It would be useful to have a figure describing the song system in European starlings and how the auditory areas, the cerebellum and the hippocampus are connected to it, before describing the results. It would make it easier for the broader community to make a better sense of the results.

      An additional figure was added to the introduction to give a schematic overview of the song control system, the auditory system and the proposed cerebellar and hippocampal projections. This scheme includes both a 2D, and a 3D representation as well as a movie of the 3D representation of the different nuclei and the tractography.

      Figure 1: Simplified overview of the experimental setup (A), schematic overview of the song control and auditory system of the songbird brain and the cerebellar and hippocampal connections to the rest of the brain (B) and unilateral DWI-based 3D representation of the different nuclei and the interconnecting tracts as deduced from the tractogram (C). Male and female starlings were measured repeatedly as they went through different photoperiods. At each time point, their songs were recorded, blood samples were collected and T2-weighted 3D anatomical and diffusion weighted images (DWI) were acquired. The 3D anatomical images were used to extract whole brain volume (A). The song control system is subdivided in the anterior forebrain pathway (blue arrows) and the song motor pathway (red arrows). The auditory pathway is indicated by green arrows. The orange arrows indicate the connection of the lateral cerebellar nucleus (CbL) to the dorsal thalamic region further connecting to the song control system as suggested by (Person, Gale et al. 2008, Pidoux, Le Blanc et al. 2018) (B,C). Nuclei in (C) are indicated in grey, the tractogram is color-coded according to the standard red-green-blue code (red = left-right orientation (L-R), blue = dorso-ventral (D-V) and green = rostro-caudal (R-C)). For abbreviations see abbreviation list.

      Figure 1 – figure supplement 1: Movie of the unilateral 3D representation of the different nuclei and the interconnecting tracts rotating along the vertical axis.

      4) In the results section the authors clearly describe which brain areas are sexually dimorphic or change during the photoperiod and what is the underlying reason for the difference. However, only in the discussion section it is clearer why some of those differences are expected or surprising. It would be useful to incorporate some of those explanations in the results section other than just having a long list of brain areas and metrics. For example, I found the involvement of visual and auditory areas in the female brain in the mating season very interesting.

      Next to the reductions in technical explanation suggested by reviewer #2, We replaced some of the description of significant regions with parts of the discussion and vice versa(indicated in purple). This way we incorporated some of the explanations why certain findings are unexpected or relevant. Furthermore, we added some extra info on the reason why these changes are relevant for the visual system and the cerebellum.

      In line 420: Neuroplasticity of the visual system could be relevant to prepare the birds for the breeding season, where visual cues like ultraviolet plumage colors are important for mate selection (Bennett, Cuthill et al. 1997).

      In line 424: This shows that multisensory neuroplasticity is not limited to the cerebrum, but also involves the cerebellum, something that has not yet been observed in songbirds.

    1. Author Response:

      Reviewer #1 (Public Review):

      This manuscript integrates conditional mouse models for TRAP, PAPERCLIP and FMRP-CLIP together with compartment specific profiling of mRNA in hippocampal CA1 neurons. Previously, similar approaches have been used to interrogate mRNA localization, differential regulation of 3'UTR isoforms, their local translation, and FMRP-dependent mRNA regulation. This study builds on these previous findings by combining all three approaches, together with analysis of mRNA dysregulation in Fmr1 KO neuron model of FXS. The strengths of the paper are the rich data sets and innovative integration of methods that will provide a valuable technical resource for the field. The weakness of the paper is the limited conceptual advance as well as lack of deeper mechanistic insights on FMRP biology over previous studies, although the present study validates and integrates past studies, adding some new information on 3'UTR isoforms.

      We appreciate the Reviewer’s recognition that “the present study validates and integrates past studies, adding some new information on 3'UTR isoforms”. We also appreciate the Reviewer’s recognition that “The strengths of the paper are the rich data sets and innovative integration of methods that will provide a valuable technical resource for the field.”

      We differ, however, with the concern that the work presents a “limited conceptual advance.” Specifically, we find, for the first time, that FMRP regulates two different biologically coherent sets of mRNAs in CA1 neuronal cell bodies and neurites. This provides a profound new insight into FMRP-RNA regulation, including the fact that these two different sets of mRNA targets (encoding chromatin-associated proteins and synaptic proteins, respectively) are both translationally regulated by FMRP and transcribed from genes implicated in autism.

      We recognize that FMRP was known, by our own work and that of others (as noted by the Reviewer) to regulate specific targets “in bulk” in neuronal cell types, brain and even in CA1 neurons. What is most unexpected here? Among directly bound FMRP mRNAs in brain CA1 neurons, there is subcellular compartmentalization of this regulation. This is new for FMRP, and in fact is new for RNA binding proteins more generally (recognizing of course the extensive work on RNA localization in different compartments previously discovered by others, beginning with Rob Singer’s work on actin localization and up to the present in work on neurons).

      We also think it is also important for readers to understand up-front the novelty in “combining approaches” referred to. We use cell-specific (cTag) CLIP to define direct FMRP interactions in subcompartments--dendrites vs cell bodies--of CA1 neurons within mouse brain hippocampus. We also normalize this data to ribosome-bound mRNAs in CA1 neurons, and validate observations by studying WT and FMRP-null brains. This set of complex mouse models and methods is completely new, and its application is what allowed us to make robust conclusions about FMRP translational regulation of different mRNAs in different cellular compartments.

      We strongly disagree with the Reviewer’s comment that FMRP directly interacts with functional classes of mRNAs in different cellular compartments “has previously been shown in the field.” Compartment-specific FMRP-CLIP has not been reported that we’re aware of, much less in a cell-type specific manner. Our previous cell-type specific FMRP-CLIP experiments have been on bulk neuronal material (Sawicka et al. 2019; Van Driesche et al., n.d.). Although cell-type specific TRAP-seq has been performed on microdissected CA1 compartments (Ainsley et al. 2014), investigators were unable to isolate significant amounts of RNA from resting neurons, and degradation of the isolated RNAs did not allow the types of 3’UTR and alternative splicing analyses that were performed here. The Schuman group has performed extensive analysis of mRNAs from microdissected CA1 compartments (Cajigas et al. 2012a; Tushev et al. 2018), but have not performed FMRP-CLIP or any experiments using cell-type specific or direct protein-RNA regulatory methods. In vitro systems have been used to analyze mRNA localization in FMRP KO systems (i.e. (Goering et al. 2020)), but in vitro systems are unable to fully recapitulate the complexities of in vivo brain regions, and did not analyze direct RNA-protein interactions. As our work is on in vivo brain slices, is cell-type specific, and integrates TRAP-seq, PAPERCLIP and CLIP-seq datasets, we believe that our work is novel and will be of great interest to the field.

      Despite the fact that FMRP targets are overrepresented in the dendritic transcriptome, it does not appear from this study that FMRP plays an active role in the mechanism of dendritic mRNA localization, at least under steady state conditions. One goal of the manuscript is to address a major question in the mRNA localization field, which is how FMRP may differentially modulate "localization" of functional classes of mRNAs such as those encoding transcriptional regulators and synaptic plasticity genes (Line 78-90). The data here indicate that FMRP directly interacts with functional classes of mRNAs in different cellular compartments, which has previously been shown in the field. However, no evidence is provided that mechanistically reveal a role for FMRP to promote subcellular localization of different functional classes of mRNAs. The correlative evidence presented in this manner does not add mechanistic insight.

      We do recognize that the question of what localizes FMRP mRNA targets differentially in the dendrite (and cell body) is of great interest, and remains unanswered. We also appreciate that, despite the Reviewer’s comment above, they also recognize “it does not appear from this study that FMRP plays an active role in the mechanism of dendritic mRNA localization, at least under steady state conditions.”

      We believe that some of the confusion here lies in the Reviewer’s comment “One goal of the manuscript is to address a major question in the mRNA localization field, which is how FMRP may differentially modulate "localization" of functional classes of mRNAs such as those encoding transcriptional regulators and synaptic plasticity genes (Line 78-90).” While this is a question of interest that has been studied, we think there is a major disconnect here in the Reviewer’s comments and our findings. To be clear, in the original manuscript, we did not find evidence, in WT vs KO CA1 neurons, that FMRP was acting to differentially localize mRNAs, including those mentioned by the Reviewer.

      Nonetheless, to further address the issue of a possible role for FMRP in localizing the transcripts it regulates, we have now performed quantitative analysis of FMRP target mRNA localization in dendrites from WT vs. Fmr1 KO mice. These results are now presented in Supplemental Figures 9 and 10 of the manuscript, and which we present and summarize below.

      Supplemental Figure 9. FMRP is not required for localization of its targets into the dendrites of CA1 neurons. A) Dendrite-enriched mRNAs were defined in FMRP KO mice (red) in the same manner as in Figure 1 for FMRP WT animals using bulk RNA-seq and TRAP-seq data. Overlap with dendrite-enriched mRNAs in WT (Figure 1, shown here in green) and CA1 FMRP targets (blue) in shown. 95.6% of dendrite-enriched FMRP targets in the WT were also found to be enriched in the dendrites of FMRP KO animals. B) Dendrite-present mRNAs were defined in FMRP KO. Overlap with dendrite-present mRNAs in WT (Figure 1) and CA1 FMRP targets is shown. 95.7% of dendrite-present FMRP targets in WT are also to be found as dendrite-present in KO animals. C-E) FISH was performed to assess FMRP target localization (Kmt2d (C) , Lrrc7 (D) and Map2 (E)) in FMRP KO mouse brain slices. Left panel shows the proportion of detected mRNAs that were detected in the neuropil (> 10 um from the predicted Cell bodies layer) in WT and KO animals. Wilcoxon ranked sum was performed to detect significance. Middle panel shows densitometry of 1000 spots samples from each picture analyzed. Distance from the CB was determined as described in methods and Figure 1. In the right panel, spots were binned into 15 groups according to the distance traveled from the CB, and the fraction of spots in each genotype in this range was analyzed by t-test to determined differences in the fraction of spots at each location in FMRP WT and KO animals (* indicates p-value < .05, ** is < .01).

      Supplemental Figure 10. FMRP is not required for differential localization of 3’UTR isoforms of its targets. A) Differential 3’UTR usage was analyzed using DEXseq as described in Figure 2 to identify 3’UTRs whose ratio of usage between neuropil and CB in FMRP WT and KO animals were altered. Shown is results from DEXseq analysis showing the log2foldChange (neuropil vs cell bodies, KO vs WT) and -log10(p-value) of each 3’UTR. Gray spots indicate that all 3’UTRs analyzed have an FDR > .05, indicating no significant change in usage between FMRP KO and WT animals. B and C) FISH analysis of localization of 3’UTR isoforms of Cnksr2 (B) and Anks1b (C ) isoforms in FMRP WT and KO animals. These genes were found in Figure 2 to express 3’UTR isoforms that are differentially localized to dendrites. Sequestered isoforms are those that are significantly localized to cell bodies in FMRP WT, and Localized are those that are significantly used in the dendrites of WT CA1 neurons. Left panel, the fraction of spots that are found to be localized to the neuropil (> 10 um from the cell body layer) are shown for each isoform in FMRP WT and KO animals. Differences were assessed by wilcoxon ranked sum tests. Middle panel, densitometry of the distance traveled from the cell bodies for a representative 1000 spots from each picture that was analyzed. Right panel, as described in Supplemental Figure 9, detected mRNAs were binned into 15 bins according to the distance traveled from the cell bodies, and differences in the fractions of spots in each bin in FMRP WT and KO slices were analyzed. Significance indicates results of t-tests (* indicates p-value < .05).

      In summary, we characterized the dendritic transcriptome in FMRP KO animals, and compared it to the FMRP WT results presented in Figures 1 and 2, as suggested by the Reviewers. We find that the dendritic transcriptome of FMRP KO animals is extremely similar to that of FMRP WT animals, with ~95% of mRNAs found to be dendrite-present or dendrite-enriched in WT also being found in FMRP KO animals (Figure S9). We validated these results with FISH and found no evidence for significant disruption in the localization of FMRP targets Kmt2d (Figure S9C), Lrrc7 (Figure S9D) or Map2 (Figure S9E) to the CA1 neuropil.

      To detect FMRP-dependent changes in distribution of 3’UTR isoforms of FMRP targets, we first performed global analysis of 3’UTR usage in TRAP from FMRP KO animals, using the expressed 3’UTR isoforms that were found in Figure 2. DEXseq analysis on 3’UTR expression in CA1 neuropil vs cell bodies TRAP showed no significant instances of altered 3’UTR usage ratios in FMRP KO animals (Figure S10A). We validated these results by performing FISH on the sequestered and localized 3’UTR isoforms of Cnksr2 and Anks1b genes and show no significant changes in the localization of the 3’UTR isoforms in FMRP KO animals (Figure S10B-C). Taken together, this data suggests that FMRP is not significantly involved in localization of its targets in resting CA1 neurons, but rather shows remarkable selection for localized mRNA isoforms. Instead, we find evidence that FMRP regulates the ribosome association of its targets in a compartment-specific manner by showing an increase in ribosome association of a subset of FMRP targets in the dendrites of CA1 neurons (see Figure 7E).

      Besides the addition of the figures described above, we have also now made corrections to the text of the manuscript, enumerated below, to address this.

      First, we have, as much as possible, reduced our emphasis throughout the manuscript on the “localization” of mRNAs and rather point out that the study seeks to characterize the differences between the regulated transcriptomes in CA1 cell bodies and dendrites. For example, for Figure 4, instead characterizing the log2FoldChange (neuropil vs CA1 cell bodies) as “dendritic localization”, we change the wording to “relative dendritic abundance” to focus on changes in the abundance of these transcripts in the dendrite vs the cell bodies. We also changed the section heading in the results that describes analysis in the FMRP KO animal from “Dysregulation of mRNA localization in FMRP KO animals” to “FMRP regulates the ribosome association of its targets in dendrites”. We believe that these changes will help to clear up this confusion for the reader.

      Second, we reformatted the model in Figure 7F. The new version of the model (shown here) emphasizes the point that our study reveals compartment-specific FMRP regulation of a subset of its targets without implying a role for FMRP in the mRNA localization of these transcripts. The text of the manuscript and figure legends have been updated accordingly.

      Figure 7F Distinct, compartment-specific FMRP regulation of functionally distinct subsets of mRNAs in CA1 cell bodies and dendrites. In dendrites, the absence of FMRP increases the ribosome association of its targets; this finding is consistent with a model in which FMRP inhibits ribosomal elongation and thereby translation (J. C. Darnell et al. 2011). In resting neurons, the translation of FMRP-bound mRNAs encoding synaptic regulators (FM2 and FM3 mRNAs) is repressed. When FMRP is absent, due to either genetic alteration (FMRP KO or FXS) or neuronal activity-dependent regulation (e.g. FMRP calcium-dependent dephosphorylation (Lee et al. 2011; Bear, Huber, and Warren 2004), ribosome association and translation of targets are increased. In cell bodies, FMRP binds mRNAs that encode for chromatin regulators (the FM1 cluster of FMRP targets), as well as FM2/3 mRNAs (consistent with synapses forming on the cell soma). FM1 targets show patterns of mRNA regulation similar to what our group observed in bulk CA1 neurons: FMRP target abundance is decreased in FMRP KO cells, perhaps due to loss of FMRP-mediated block of degradation of mRNAs with stalled ribosomes (Sawicka et al. 2019; R. B. Darnell 2020).

      Third, we have revised the Discussion in order to more completely discuss the model above and also emphasize the finding that FMRP was not found to be involved in the localization of its mRNA targets, but rather in the regulation of the local translation of its targets in a compartment-specific manner. We further speculate on the roles of FMRP in regulation of mRNA abundance and translation in these compartments.

      We hope that these changes better reflect the interpretation and novelty of our findings for both the Reviewers and the readers.

      Further related to a role of FMRP in mRNA localization, a recent paper in eLife reports that FMRP RGG box promotes mRNA localization of a set of FMRP targets through G-quadruplexes (Goering et al 2020). This relevant paper needs to be cited and discussed.

      We apologize for this omission, and have now cited and discussed this paper in the Results and Discussion of the manuscript. Importantly, we find that dendrite-enriched mRNAs have high GC content (see figure below, which is now Supplemental Figure 5). This complicates the discovery of potential G-quadruplexes; put another way, G-rich mRNAs will therefore be enriched when compared to not-localized mRNAs, and this is also true for C-rich mRNAs. Dendrite-enriched FMRP directly-bound CA1 neuronal targets (defined by CLIP) are actually G-poor when compared to dendrite-enriched FMRP non-targets (see new Figure S5 and below).

      Supplemental Figure 5A-D: Dendrite-enriched are GC rich and dendrite-enriched FMRP targets are GC poor compared to dendrite-enriched non FMRP targets. A) Schematic of the overlap between CA1 FMRP targets and dendrite-enriched mRNAs (defined in Main Figure 1) B) GC content, as defined by percent G + C for all CA1 mRNAs, dendrite enriched mRNAs (1211), dendrite-enriched FMRP targets (413), and dendrite-enriched non-FMRP targets (798, see A). Stars indicate significance in wilcoxon rank sum tests ( is p < .05, ** is p < .0001). C) G content, as defined by percent G, D) C content, as defined by percent C.

      In light of these observations, analysis of G- or C- containing motifs needs to be examined in this context. To this end, we performed the experiments suggested here, but did so by searching for the prevalence of G-quadruplexes in dendrite-enriched FMRP targets versus dendrite-enriched FMRP non-targets (Figure S5A). To do this, we used both experimentally-defined G-quadruplexes (described in (Guo and Bartel 2016), Figure S5E), as well as motifs (described in (Goering et al. 2020), Figure S5F). We include the results below, and in a new Figure S5 in the paper.

      Supplemental Figure 5E-F: mRNAs containing G-quadruplexes are not enriched in dendritic FMRP targets vs dendrite-enriched non-FMRP targets. E) The percent of all CA1 mRNAs, all dendrite-enriched mRNAs, dendrite-enriched FMRP-bound targets (413), and dendrite-enriched non-FMRP targets (798) that contain experimentally-defined G-quadruplexes is plotted. Shown are the results of chi-squared analysis comparing the enrichment of G-quadruplex containing mRNAs in dendrite-enriched FMRP targets vs dendrite-enriched non-FMRP targets. F) As in E, except looking for the presence of mRNAs with G-quadruplex motifs in 3’UTRs as described in (Goering et al. 2020)

      Interestingly, we found no difference in the presence of G-quadruplex motifs in the 3’UTRs of these two sets (above and new Supplemental Figure 5). For example, of 413 dendrite-enriched FMRP targets, 100 (24%) had experimentally defined G-quadruplexes in the 3’UTRs, while 159 (22.5%) dendrite-enriched non-FMRP targets had experimentally defined G-quadruplexes. These differences were not significant (by chi-square test).

      Searching the 3’UTR sequences of 413 dendrite-enriched FMRP targets above for G-quadruplex motifs (as described in (Goering et al. 2020), which searched for an empirically derived specific motif: GW--G, separated by 7nt), we only found 3 instances in dendrite-enchriched FMRP-bound target mRNAs. Similarly, we found out of 798 non-FMRP targets, only a small subset (6) contained this specific motif in their 3’UTRs. These results were not significant (chi-square test).

      In summary, we do not find evidence in our data of G-quadruplexes playing a role in determination of FMRP binding in CA1 dendrites. This data is now included in the results and discussed in the Discussion of the paper.

      Reviewer #2 (Public Review):

      The authors performed transcriptomic analyses from compartment-specific, micro-dissected hippocampal CA1 region tissue from transgenic mice. One feature that distinguishes this work from previous studies is the use of conditional knock-in of tags (GFP or HA) and tissue specific expression of the Cre recombinase to target a very specific population of pyramidal neurons in the CA1 region--as well as the combined use of TRAPseq, PAPERCLIP and FMRP-CLIP. Also, central to this work are the analysis pipelines that look at large populations of mRNA with the goal of finding features shared by those mRNA that bind FMRP.

      First, they established the identity of mRNAs that are dendritically enriched or/and alternatively polyadenylated (APA) by sequencing; followed by validation of a few candidates using smFISH. Next, the APA data was filtered through the rMATS statistical program to identify alternatively spliced (AS) mRNA variants within the APA population. The authors concluded that the majority of splicing events were of the exon-skipping type with NOVA2 as the likely culprit leading to this differential localization of AS isoforms. The authors then proceeded to perform FMRP-CLIP which was analyzed against the TRAP dataset. The (413) mRNAs that were shared by the two experiments (TRAP and FMRP-CLIP) exhibited two notable features: dendrite-enrichment and longer average transcript length. More importantly, They demonstrated that FMRP can preferentially bind to an AS isoform that is enriched in dendrites. Further analyses of FMRP CLIP targets showed that they shared a significant level of genes designated by gene set enrichment analysis (GSEA) as involved in ion transport and receptor signaling and similarly for ASD-related candidate genes.

      Strengths: -The combined use of tissue-specific Cre and conditional tags for RPL22, PABPC1 and FMRP help make these pull-downs highly specific and robust. -RNA sequencing approach allows for identification and comparison of populations of ribosome-, PABPC1- and FMRP-associated mRNAs. -Preferential binding of FMRP to AS or APA isoforms in dendrites is an impactful and significant finding.

      Weaknesses: -A caution in interpreting comparative or differential RNA-sequencing results as some are correlative.

      We appreciate this concern, and agree that RNA-seq analysis alone can be difficult to interpret. However, we feel that our unique approach of combining multiple cell-type specific approaches, including CLIP-seq and PAPERCLIP along with TRAP-seq and RNA-seq result in stronger conclusions that are supported by multiple lines of evidence.

      -Validation of FMRP interaction with AS or APA isoforms or ASD candidates by smFISH-IF is lacking.

      We find that smFISH-IF in the CA1 neuropil is difficult to interpret in mouse brain slices due to dense networks of processes in addition to contaminating cell types, making IF signals dense, noisy and difficult to quantitate. Although we could theoretically attempt these experiments using an in vitro cell culture model, we believe that the novelty of our work is in a) the cell-type specific nature of our analyses and in b) the fact that our analysis and validation is all performed in vivo. We do not feel confident that in vitro systems are similar enough to our in vivo system to be relevant for this work. This is due not only to differences in their transcriptomes, but also due to the limited number of synapses in vitro cells make with other neurons when compared to CA1 neurons in the brain. Instead, we validate the interactions between FMRP and AS and APA isoforms by isolating junction reads among FMRP-CLIP tags isolated in a cell-type specific manner from intact mouse brains (Figure 5). In this manner, we find direct evidence of FMRP selectively binding to dendritic mRNA isoforms in vivo.

      -Although hippocampal CA1 region is an excellent site to study FMRP-RNA interactome, are there other projection systems where altered FMRP-RNA interaction may lead to greater dysfunction?

      We appreciate this point and now include this in the revised Discussion.

    1. Author reponse

      Reviewer #1 (Public Review):

      In their paper, Kroell and Rolfs use a set of sophisticated psychophysical experiments in visually-intact observers, to show that visual processing at the fovea within the 250ms or so before saccading to a peripheral target containing orientation information, is influenced by orientation signals at the target. Their approach straddles the boundary between enforcing fixation throughout stimulus presentation (a standard in the field) and leaving it totally unconstrained. As such, they move the field of saccade pre-processing towards active vision in order to answer key questions about whether the fovea predicts features at the gaze target, over what time frame, with what precision, and over what spatial extent around the foveal center. The results support the notion that there is feature-selective enhancement centered on the center of gaze, rather than on the predictively remapped location of the target. The results further show that this enhancement extends about 3 deg radially from the foveal center and that it starts ~ 200ms or so before saccade onset. They also show that this enhancement is reinforced if the target remains present throughout the saccade. The hypothesized implications of these findings are that they could enable continuity of perception trans-saccadically and potentially, improve post-saccadic gaze correction.

      Strengths:

      The findings appear solid and backed up by converging evidence from several experimental manipulations. These included several approaches to overcome current methodological constraints to the critical examination of foveal processing while being careful not to interfere with saccade planning and performance. The authors examined the spatial frequency characteristics of the foveal enhancement relative, hit rates and false alarm rates for detecting a foveal probe that was congruent or incongruent in terms of orientation to the peripheral saccade target embedded in flickering, dynamic noise (i/f )images. While hit rates are relatively easy to interpret, the authors also reconstructed key features of the background noise to interpret false alarms as reflecting foveal enhancement that could be correlated with target orientation signals. The study also - in an extensive Supplementary Materials section - uses appropriate statistical analyses and controls for multiple factors impacting experimental/stimulus design and analysis. The approach, as well as the level of care towards experimental details provided in this manuscript, should prove welcome and useful for any other investigators interested in the questions posed.

      Weaknesses:

      I find no major weaknesses in the experiments, analyses or interpretations. The conclusions of the paper appear well supported by the data. My main suggestion would be to see a clearer discussion of the implications of the present findings for truly naturalistic, visually-guided performance and action. Please consider the implication of the phenomena and behaviors reported here when what is located at the gaze center (while peripheral targets are present), is not a noisy, relatively feature-poor, low-saliency background, but another high-saliency target, likely crowded by other nearby targets. As such, a key question that emerges and should be addressed in the Discussion at least is whether the fovea's role described in the present experiments is restricted to visual scenarios used here, or whether they generalize to the rather different visual environments of everyday life.

      This is a very interesting question. While we cannot provide a definite answer, we have added a paragraph discussing the role of foveal prediction in more naturalistic visual contexts to the Discussion section (‘Does foveal prediction transfer to other visual features and complex natural environments?’). We pasted this paragraph in response to another comment in the ‘Recommendations for the authors’ section below. We suggest that “the pre-saccadic decrease in foveal sensitivity demonstrated previously[9] as well as in our own data (Figure 2B) may boost the relative strength of fed-back signals by reducing the conspicuity of foveal feedforward input”, presumably allowing the foveal prediction mechanism to generalize to more naturalistic environments with salient foveal stimulation.

      Reviewer #2 (Public Review):

      Human and primates move their eyes with rapid saccades to reposition the high-resolution region of the retina, the fovea, over objects of interest. Thus, each saccade involves moving the fovea from a pre-saccadic location to a saccade target. Although it has been long known that saccades profoundly alter visual processing at the time of saccade, scientists simply do not know how the brain combines information across saccades to support our normal perceptual experience. This paper addresses a piece of that puzzle by examining how eye movements affect processing at the fovea before it moves. Using a dynamic noise background and a dual psychophysical task, the authors probe both the performance and selectivity of visual processing for orientation at the fovea in the few hundred milliseconds preceding a saccade. They find that hit rates and false alarm rates are dynamically and automatically modulated by the saccade planning. By taking advantage of the specific sequence of noise shown on each trial, they demonstrate that the tuning of foveal processing is affected by the orientation of the saccade target suggesting foveal specific feedback.

      A major strength of the paper is the experimental design. The use of dynamic filtered noise to probe perceptual processing is a clever way of measuring the dynamics of selectivity at the fovea during saccade preparation. The use of a dual-task allows the authors to evaluate the tuning of foveal processing as well and how it depends on the peripheral target orientation. They show compellingly that the orientation of the saccade target (the future location of the fovea) affects processing at the fovea before it moves.

      There are two weaknesses with the paper in its current form. The first is that the key claim of foveal "enhancement" relies on the tuning of the false alarms. A more standard measure of enhancement would be to look at the sensitivity, or d-prime, of the performance on the task. In this study, hits and false alarms increase together, which is traditionally interpreted as a criterion shift and not an enhancement. However, because of the external noise, false alarms are driven by real signals. The authors are aware of this and argue that the fact that the false alarms are tuned indicates enhancement. But it is unclear to me that a criterion shift wouldn't also explain this tuning and the change in the noise images. For example, in a task with 4 alternative choices (Present/Congruent, Present/Incongruent, Absent/Congruent, Absent/Incongruent), shifting the criterion towards the congruent target would increase hits and false alarms for that target and still result in a tuned template (because that template is presumably what drove the decision variable that the adjusted criterion operates on). I believe this weakness could be addressed with a computational model that shows that a criterion shift on the output of a tuned template cannot produce the pattern of hits and false alarms.

      We thank the reviewer for this comment. We will present three arguments, each of which suggests that our effects are perceptual in nature and cannot be explained by a shift in decision criterion: (1) the temporal specificity of the difference in Hit Rates (HRs), (2) the spatial specificity of the difference in HRs and (3) the phenomenological quality of the foveally predicted signal. In general, a criterion shift would indeed affect hits and false alarms alike. Nonetheless, the difference in HRs only manifested under specific and meaningful conditions:

      First, the increase in congruent as compared to incongruent HRs, i.e., enhancement, was temporally specific: congruent and incongruent HRs were virtually identical when the probe appeared in a baseline time bin or one (Figure 2B) or even two (Figure 4A) early pre-saccadic time bins. Based on another reviewer’s comment, we collected additional data to measure the time course and extent of foveal enhancement during fixation. While pre-saccadic enhancement developed rapidly, enhancement started to emerge 200 ms after target onset during fixation. Crucially, these time courses mirror the typical temporal development of visual sensitivity during pre-saccadic attention shifts and covert attentional allocation, respectively[8,33]. We are unaware of data demonstrating similar temporal specificity for a shift in decision criterion. One could argue that a template of the target orientation needs to build up before it can influence criterion. Nonetheless, this template would be expected to remain effective after this initial temporal threshold has been crossed. In contrast, we observe pronounced enhancement in medium but not late stages of saccade preparation in the PRE-only condition (Figure 4A).

      Second, it has been argued that a defining difference between innately perceptual effects and post-perceptual criterion shifts is their spatial specificity[53]: in opposition to perceptual effects, criterion shifts should manifest in a spatially global fashion. Due to a parafoveal control condition detailed in our reply to the next comment, we maintain the claim that enhancement is spatially specific: congruent HRs exceeded incongruent ones within a confined spatial region around the center of gaze. We did not observe enhancement for probes presented at 3 dva eccentricity even when we raised parafoveal performance to a foveal level by adaptively increasing probe contrast. The accuracy of saccade landing or, more specifically, the mean remapped target location (Figure 3B) influenced the spatial extent of the enhanced region in a fashion that is reconcilable with previous findings[30]. A criterion shift that is both spatially and temporally selective, follows the time course of pre-saccadic or covert attention depending on observers’ oculomotor behavior, does not remain effective throughout the entire trial after its onset, is sensitive to the mean remapped target location across trials, and does not apply to parafoveal probes even after their contrast has been increased to match foveal performance, would be unprecedented in the literature and, even if existent, appear just as functionally meaningful as sensitivity changes occurring under the same conditions.

      Lastly and on a more informal note, we would like to describe a phenomenological percept that was spontaneously reported by 6 out of 7 observers in Experiment 1 and experienced by the author L.M.K. many times. On a small subset of trials, participants in our paradigms have the strong phenomenological impression of perceiving the target in the pre-saccadic center of gaze. This percept is rare but so pronounced that some observers interrupt the experiment to ask which probe orientation they should report if they had perceived two on the same trial (“The orientation of the normal probe or of the one that looked exactly like the target”). Interestingly, the actual saccade target and its foveal equivalent are perceived simultaneously in two spatiotopically separate locations, suggesting that this percept cannot be ascribed to a temporal misjudgment of saccade execution (after which the target would have actually been foveated). We have no data to prove this observation but nonetheless wanted to share it. Experiencing it ourselves has left us with no doubt that the fed-back signal is truly – and almost eerily – perceptual in nature.

      The analysis suggested by the reviewer is very interesting. Yet for several reasons stated in the ‘Suggestions to the authors’ section, our dataset is not cut out for an analysis of noise properties at this level of complexity. We had always planned to resolve these concerns experimentally, i.e., by demonstrating specificity in HRs. We believe that our arguments above provide a strong case for a perceptual phenomenon and have incorporated them into the Discussion of our revised manuscript.

      The second weakness is that the author's claim that feedback is spatially selective to the fovea is confounded by the fact that acuity and contrast sensitivity are higher in the fovea. Therefore, the subject's performance would already be spatially tuned. Even the very central degree, the foveola, is inhomogeneous. Thus, finding spatially-tuned sensitivity to the probes may simply indicate global feature gain on top of already spatially tuned processing in the fovea. Another possible explanation that is consistent with the "no enhancement" interpretation is that the fovea has increased. This is consistent with the observation that the congruency effects were aligned to the center of gaze and not the saccade endpoint. It looks from the Gaussian fits that a single gain parameter would explain the difference in the shape of the congruent and incongruent hit rates, but I could not figure out if this was explicitly tested from the existing methods. Additional experiments without prepared saccades would be an easy way to address this issue. Is the hit rate tuned when there is no saccade preparation? If so, it seems likely that the spatial selectivity is not tuned feedback, but inhomogeneous feedforward processing.

      We fully agree. We do not consider a fixation condition diagnostic to resolve this question since, as of now, correlates of foveal feedback have exclusively been observed during fixation. In those studies, it was suggested that the effect, i.e., a foveal representation of peripheral stimuli, reflects the automatic preparation of an eye movement that was simply not executed[11,12,14]. To address another reviewer’s comment, we collected additional data in a fixation experiment. The probe stimulus could exclusively appear in the screen center (as in Experiment 1) and observers maintained fixation throughout the trial. While pre-saccadic congruency effects were significantly more pronounced and developed faster, congruency effects did emerge during fixation when the probe appeared 200 ms after the target. If pre-saccadic processes indeed spill over to fixation tasks to some extent and trigger relevant neural mechanisms even when no saccade is executed, we could expect a similar feedback-induced spatial profile during fixation. Since this matches the reviewer’s prediction if the pre-saccadic profiles resulted from inhomogeneous feedforward processing, we do not consider a fixation condition suitable to distinguish between both hypotheses.

      To test whether the tuning of enhancement is effectively a consequence of declining visual performance in the parafovea/periphery, we instead raised parafoveal performance to a foveal level by adaptively increasing the opacity of the probe: while leaving all remaining experimental parameters unchanged, we presented the probe in one of two parafoveal locations, i.e., 3 dva to the left or right of the screen center. Observers were explicitly informed about the placement of the probe. We administered a staircase procedure to determine the probe opacity at which performance for parafoveal target-incongruent probes would be just as high as foveal performance had been in the preceding sessions. While the foveal probe was presented at a median opacity of 28.3±7.6%, a parafoveal opacity of 39.0±11.1% was required to achieve the same performance level. As a result, the gray dot at 0 dva in the figure below represents the incongruent HR in the center of gaze and ranges at 80% on the y-axis. The gray dots at ±3 dva represent incongruent parafoveal HRs and also range at ~80% on the y-axis. Using the reviewer’s terminology, we effectively removed the influence of acuity- (or contrast-sensitivity-) dependent spatial tuning. If the spatial profiles had indeed been the result of “global feature gain on top of already spatially tuned processing“, this manipulation should render parafoveal feature gain just as detectable as foveal feature gain. Instead, congruent and incongruent parafoveal HRs were statistically indistinguishable (away from the saccade target: p = .127, BF10 = 0.531; towards the saccade target: p = .336, BF10 = 0.352), inconsistent with the idea of a spatially global feature gain.

      We had included these data in our initial submission. They were collected in the same observers that contributed the spatial profiles (Experiment 2). The data points at 0 dva in the reduced figure above correspond to the foveal probe location in Figure 2D. The data points at ±3 dva had been plotted and discussed in our initial submission, yet only very briefly. Based on this and another reviewer’s comment, we realize that we should have explained this condition more extensively in the main text rather than in the Methods and have added a dedicated paragraph to the Results section.

      This paper is important because it compellingly demonstrates that visual processing in the fovea anticipates what is coming once the eyes move. The exact form of the modulation remains unclear and the authors could do more to support their interpretations. However, understanding this type of active and predictive processing is a part of the puzzle of how sensory systems work in concert with motor behavior to serve the goals of the organism.

      Reviewer #3 (Public Review):

      This manuscript examines one important and at the same time little investigated question in vision science: what happens to the processing of the foveal input right before the onset of a saccade. This is clearly something of relevance as humans perform saccades about 3 times every second. Whereas what happens to visual perception in the visual periphery at the saccade goal is well characterized, little is known about what happens at the very center of gaze, which represents the future retinal location where the saccade target will be viewed at high resolution upon landing. To address this problem the authors implemented an elegant experiment in which they probed foveal vision at different times before the onset of the saccade by using a target, with the same or different orientation with respect to the stimulus at the saccade goal, embedded in dynamic noise. The authors show that foveal processing of the saccade target is initiated before saccade execution resulting in the visual system being more sensitive to foveal stimuli which features match with those of the stimuli at the saccades goal. According to the authors, this process enables a smooth transition of visual perception before and after the saccade. The experiment is well designed and the results are solid, overall I think this work represents a valuable contribution to the field and its results have important implications. My comments below:

      1. The change in the overall performance between the baseline condition and when the probe is presented after the saccade target is large, but I wonder if there are other unrelated factors that contribute to this difference, for example, simply presenting the probe after vs before the onset of a peripheral stimulus, or the fact that in the baseline the probe is presented right after a fixation marker, but in the other condition there was a longer time interval between the presentation of the marker and the probe transient. The authors should discuss how these confounding factors have been accounted for.

      We thank the reviewer for this helpful comment. We would like to clarify that the probe was never presented right after the fixation dot. In the baseline condition, fixation dot and target were separated by 50 ms, i.e., the duration of one noise image. Since the fixation dot was an order of magnitude smaller than the probe (0.3 vs 3 dva in diameter) and since two large-field visual transients caused by the onset of a new background noise image occurred between fixation dot disappearance and probe appearance, we consider it unlikely that the performance difference was caused by any kind of stimulus interaction such as masking. Nonetheless, we had been puzzled by this difference already when inspecting preliminary results and wondered if it may reflect observers’ temporal expectations about the trial sequence. We therefore explicitly instructed and repeatedly reminded observers that the probe could appear before the peripheral target. Since the difference persisted, we ascribed it to a predictive remapping of attention to the fovea during saccade preparation, as we had stated in the Discussion.

      Another contributing factor may be that observers approached the oculomotor and perceptual detection tasks sequentially. In early trial phases, they may have prioritized localizing the target and programming the eye movement. After motor planning had been initiated, resources may have been freed up for the foveal detection task. Since on the majority of probe-present trials, the probe appeared after the saccade target, this strategy would have been mostly adaptive. Crucially, however, observers yielded similar incongruent Hit Rates in the baseline and last pre-saccadic time bin (70% vs 74%). While we observed pronounced enhancement in the last pre-saccadic bin, congruent and incongruent Hit Rates in the baseline bin were virtually identical. We therefore conclude that lower overall performance in the baseline bin did not prevent congruency effects from occurring. Instead, congruency effects started developing only after target appearance. We have added this potential explanation to the Results.

      1. Somewhat related to point 3, the authors conclude that the effects reported here are the result of saccade preparation/execution, however, a control condition in which the saccade is not performed is missing. This leaves me wondering whether the effect is only present during saccade preparation or if it may also be present to some extent or to its full extent when covert attention is engaged, i.e when subjects perform the same task without making a saccade.

      Foveal feedback has, as of now, exclusively been demonstrated during fixation (see references in Introduction and Discussion). In most of these studies, it was suggested that these effects (i.e., the foveal representation of a peripheral stimulus) may reflect the automatic preparation of an eye movement that was simply not executed[11,12,14]. Since foveal feedback has been demonstrated during fixation, and since eye movement preparation may influence foveal processing even when the eyes remain stationary, we considered it likely that congruency effects would emerge during fixation. Nonetheless, we agree with the reviewer that an explicit comparison between saccade preparation and fixation would enrich our data set and allow for stronger conclusions. We therefore collected additional data from seven observers. While all remaining experimental parameters were identical to Experiment 1, observers maintained fixation throughout each trial. We found that pre-saccadic foveal enhancement was more pronounced and emerged earlier than foveal enhancement during fixation. We present these data in the Results section (Figure 5) and have updated the Methods section to incorporate this additional experiment. We have furthermore added a paragraph to the Discussion which addresses potential mechanisms of foveal enhancement during fixation and saccade preparation.

      Furthermore, the reviewer’s comment helped us realize that we never stated a crucial part of our motivation explicitly. We now do so in the Introduction:

      “Despite the theoretical usefulness of such a mechanism, there are reasons to assume that foveal feedback may break down while an eye movement is prepared to a different visual field location. First and foremost, saccade preparation is accompanied with an obligatory shift of attention to the saccade target[6-8] which in turn has been shown to decrease foveal sensitivity[9]. Moreover, the execution of a rapid eye movement induces brief motion signals on the retina[20] which may mask or in other ways interfere with the pre-saccadic prediction signal. On a more conceptual level, the recruitment of foveal processing as an ‘active blackboard’[21] may become obsolete in the face of an imminent foveation of relevant peripheral stimuli – unless, of course, foveal processing serves the establishment of trans-saccadic visual continuity.”

      We believe that the additional data and the revisions to the Introduction and Discussion have strengthened our manuscript and thank the reviewer for this comment.

      1. Differently from other tasks addressing pre-saccadic perception in the literature here subjects do not have to discriminate the peripheral stimulus at the saccade goal, and most processing resources are presumably focused at the foveal location. Could this have influenced the results reported here?

      This is true. We intentionally made the features of the peripheral target as task-irrelevant as possible, contrary to previous investigations. We wanted to ensure that the enhancement we find would be automatic and not induced by a peripheral discrimination task, as we state in the Discussion and the Methods. We agree that the foveal detection task likely focused processing resources on the center of gaze in Experiment 1. In Experiment 2, however, we measured the spatial profile of enhancement which involved two different conditions:

      1. In each observer’s first six sessions, the probe could be presented anywhere on a horizontal axis of 9 dva length. On a given trial, an observer could not predict where it would appear, and therefore could not strategically allocate their attention. Nonetheless, enhancement of target-congruent orientation information was tuned to the fovea.
      2. In the final, seventh session, the probe appeared exclusively in one of two possible peripheral locations: 3 dva to the left or 3 dva to the right of the screen center. Observers were explicitly informed that the probe would never appear foveally, and processing resources should therefore have been allocated to the peripheral probe locations. The general performance level in this condition was comparable to performance in the fovea (see reply to the next comment). Nonetheless, we did not find peripheral enhancement of target-congruent information.

      Importantly, the magnitude of the foveal congruency effect in the PRE-only condition of Experiment 1 (i.e., when the target disappeared before the eyes landed on it) was comparable to the foveal congruency effect in Experiment 2 (PRE-only throughout), suggesting that the format of the task – i.e., purely foveal detection or foveal and peripheral detection – did not alter our findings.

      1. The spatial profile of the enhancement is very interesting and it clearly shows that the enhancement is limited to a central region. To which extent this profile is influenced by the fact that the probe was presented at larger eccentricities and therefore was less visible at 4.5 deg than it was at 0 deg? According to the caption, when the probe was presented more eccentrically the performance was raised to a foveal level by adaptively increasing probe transparency. This is not clear, was this done separately based on performance at baseline? Does this mean that the contrast of the stimulus was different for the points at +- 3 dva but the performance was comparable at baseline? Please explain.

      Based on the previous comment and comments of Reviewer #2, we realize that we should have explained this condition more extensively in the main text rather than in the Methods and have adapted the manuscript accordingly. As stated in our reply to the previous comment, Experiment 2 involved one session in which we addressed whether the lack of parafoveal/peripheral enhancement could be due to a simple decrease in acuity as mentioned by the reviewer. Observers were explicitly informed that the to-be detected stimulus (the probe) would appear either 3 dva to the left or right but never in the screen center and were shown slowed-down example trials for illustration. Observers then performed a staircase procedure which was targeted at determining the probe contrast at which performance for parafoveal target-incongruent probes would be just as high as foveal performance for target-incongruent probes had been in the previous six sessions. While the foveal probe was presented at a median opacity of 28.3±7.6%, an opacity of 39.0±11.1% was required to achieve the same performance level at a 3 dva eccentricity. Therefore, the gray curve in Figure 2D that represents incongruent Hits reaches its peak just under 80% on the y-axis. The gray dots at ±3 dva also range at ~80% on the y-axis. The performance level for target-incongruent probes (‘baseline’ here) in the parafovea is thus equal to foveal performance for target-incongruent probes. Target-congruent parafoveal feature information had the same “chance” to be enhanced as foveal information in the preceding sessions. Despite an equation of performance, we found no parafoveal enhancement. This suggests that enhancement is a true consequence of visual field location and not simply mediated by visual acuity at that location.

      1. The enhancement is significant within a region of 6.4 dva around the center of gaze. This is a rather large region, especially considering that it extends also in the direction opposite to the saccade. I was expecting the enhancement to be more confined to the central foveal region. Was the effect shown in Figure 2D influenced by the fact that saccades in this task were characterized by a large undershoot (Fig 1 D)? Did the effect change if only saccades landing closer to the target were included in the analysis? There may not be enough data for resolving the time course, but maybe there are differences in the size of the main effect.

      Width of the profile: In general, the width of the enhancement profile is likely to be influenced by two experimental/analysis choices: the size of the probe stimulus presented during the experiment and the width of the moving window combining adjacent probe locations for analysis.

      Probe size: Since the probe itself had a comparably large diameter of 3 dva, even the leftmost significant point at -2.6 dva could be explained by an enhancement of the foveal portion of the probe. We had mentioned this briefly in the Discussion but realize that this point is crucial and should be made more explicit. Moving window width: We designed the experiment with the intention to densely sample a range of spatial locations during data collection and combine a certain number of adjacent locations using a moving window during analysis (see preregistration: https://osf.io/6s24m). To ensure the reliability of every data point, the width of this window was chosen based on how many trials were lost during preprocessing. We chose a window width of 7 locations as this ensured that each data point contained at least 30 trials on an individual-observer level. Nonetheless, the width of the resulting enhancement profile depends on the width of the moving window:

      We added these caveats to the Results section and incorporated the figure above into the Supplements. We now state explicitly that…

      “the main conclusions that can be drawn are that enhancement i) peaks in the center of gaze, ii) is not uniform throughout the tested spatial range as, for instance, global feature-based attention would predict, and iii) is asymmetrical, extending further towards the saccade target than away from it.”

      For the above reasons, the absolute width of the profile should be interpreted with caution.

      Saccadic landing accuracy: To address the reviewer’s question, we inspected the spatial enhancement profile separately for trials in which the saccade landed on the target (i.e., within a radius of 1.5 dva from its center) or off-target but still within the accepted landing area. This trial separation criterion, besides appearing meaningful, ensured that all observers contributed trials to every data point. We had never resolved the time course in this experiment and could therefore not collapse across time points as suggested by the reviewer. To increase the number of trials per data point, we instead increased the width of the moving window sliding across locations from 6 to 9 neighboring locations (but see caveat above).

      Considering only saccades that landed on the target (‘accurate’; A) yielded significant enhancement from -2.6 to 2.1 dva and from 3.2 dva throughout the measured range towards the saccade target. Saccades that landed off-target (‘inaccurate’; B) showed a more pronounced asymmetry. When only considering inaccurate saccades, enhancement reached significance between -1.1 and 4.4 dva.

      The increased asymmetry for inaccurate saccades may be related to predictive remapping: since inaccurate saccades were hypometric on average, the predictively remapped location of the target was shifted towards the target by the magnitude of the undershoot. Asymmetric enhancement would therefore have boosted congruency at the remapped target location across all trials. In consequence, we inspected if aligning probe locations to the remapped target location on an individual-trial level would lead to a narrower profile for inaccurate saccades. This was not the case. Instead, we observed two parafoveal maxima (C). Their position on the x-axis equals the mean remapping-dependent leftwards (2.0 dva) and rightwards (1.9 dva) displacement across trials. In other words, they correspond to the pre-saccadic center of gaze. Note that these profiles could not be fitted with a mixture of Gaussians and were fitted using polynomials instead.  

      In sum, while we do not observe a clear narrowing of the enhancement profile for accurate saccades, the profile’s asymmetry is more pronounced for inaccurate eye movements. An increase in asymmetry could bear functional advantages since it would boost congruency at the remapped target location across all trials. Importantly though, this adjustment seems to rely on an estimate of average rather than single-trial saccade characteristics: aligning probe locations to the remapped attentional locus on an individual trial level provides further evidence that, irrespective of individual saccade endpoints, enhancement was aligned to the fovea. We have added these analyses to the Results section (Figure 3). We have also added the remapped profiles for all saccades and accurate saccades only to the Supplements.

      1. Is the size of the enhanced region around the center of gaze related to the precision of saccades? Presumably, if saccades are less precise a larger enhanced area may be more beneficial.

      This is a very interesting point. To address this question, we estimated each observer’s saccadic precision by computing bivariate kernel densities from their saccade landing coordinates. As we measured the horizontal extent of enhancement in our experiment, we defined the horizontal bandwidth as an estimate of saccadic imprecision. To estimate the size of the enhanced region for each observer, we created 10,000 bootstrapping samples for each observer’s congruent and incongruent HRs (4 locations combined at each step) We then determined the difference between the bootstrapped congruent and incongruent HRs and defined significantly enhanced locations as all locations for which <= 5% of these differences fell below zero. We then defined the width of the enhancement profile as the maximum number of consecutive significant locations.

      Instead of a positive correlation, we observed a negative correlation between the bandwidth of landing coordinates (i.e., saccadic imprecision) and the size of the enhanced window (r = -.56, p = .117). In other words, there was a non-significant tendency that the less precise an observer’s saccades, the narrower their estimated region of enhancement. We furthermore inspected the magnitude of enhancement per position within in the enhanced region. To do so, we computed the mean difference between congruent and incongruent HR across all positions in the enhanced region. The sizes of the orange circles in the figure above represent the resulting values (ranging from 2.9% to 13.3%). As saccadic precision decreases, the magnitude of enhancement per data point in the enhanced region tends to decrease as well. We therefore suggest that high saccadic precision is a sign of efficient oculomotor programming, which in turn allows peri-saccadic perceptual processes to operate more effectively. We added this analysis to the Supplements and refer to it in the Results section of the revised manuscript.

    1. Author response:

      Reviewer #1 (Public Review):

      This paper proposes a novel framework for explaining patterns of generalization of force field learning to novel limb configurations. The paper considers three potential coordinate systems: cartesian, joint-based, and object-based. The authors propose a model in which the forces predicted under these different coordinate frames are combined according to the expected variability of produced forces. The authors show, across a range of changes in arm configurations, that the generalization of a specific force field is quite well accounted for by the model.

      The paper is well-written and the experimental data are very clear. The patterns of generalization exhibited by participants - the key aspect of the behavior that the model seeks to explain - are clear and consistent across participants. The paper clearly illustrates the importance of considering multiple coordinate frames for generalization, building on previous work by Berniker and colleagues (JNeurophys, 2014). The specific model proposed in this paper is parsimonious, but there remain a number of questions about its conceptual premises and the extent to which its predictions improve upon alternative models.

      A major concern is with the model's premise. It is loosely inspired by cue integration theory but is really proposed in a fairly ad hoc manner, and not really concretely founded on firm underlying principles. It's by no means clear that the logic from cue integration can be extrapolated to the case of combining different possible patterns of generalization. I think there may in fact be a fundamental problem in treating this control problem as a cue-integration problem. In classic cue integration theory, the various cues are assumed to be independent observations of a single underlying variable. In this generalization setting, however, the different generalization patterns are NOT independent; if one is true, then the others must inevitably not be. For this reason, I don't believe that the proposed model can really be thought of as a normative or rational model (hence why I describe it as 'ad hoc'). That's not to say it may not ultimately be correct, but I think the conceptual justification for the model needs to be laid out much more clearly, rather than simply by alluding to cue-integration theory and using terms like 'reliability' throughout.

      We thank the reviewer for bringing up this point. We see and treat this problem of finding the combination weights not as a cue integration problem but as an inverse optimal control problem. In this case, there can be several solutions to the same problem, i.e., what forces are expected in untrained areas, which can co-exist and give the motor system the option to switch or combine them. This is similar to other inverse optimal control problems, e.g. combining feedforward optimal control models to explain simple reaching. However, compared to these problems, which fit the weights between different models, we proposed an explanation for the underlying principle that sets these weights for the dynamics representation problem. We found that basing the combination on each motor plan's reliability can best explain the results. In this case, we refer to ‘reliability’ as execution reliability and not sensory reliability, which is common in cue integration theory. We have added further details explaining this in the manuscript.

      “We hypothesize that this inconsistency in results can be explained using a framework inspired by an inverse optimal control framework. In this framework the motor system can switch or combine between different solutions. That is, the motor system assigns different weights to each solution and calculates a weighted sum of these solutions. Usually, to support such a framework, previous studies found the weights by fitting the weighed sum solution to behavioral data (Berret, Chiovetto et al. 2011). While we treat the problem in the same manner, we propose the Reliable Dynamics Representation (Re-Dyn) mechanism that determines the weights instead of fitting them. According to our framework, the weights are calculated by considering the reliability of each representation during dynamic generalization. That is, the motor system prefers certain representations if the execution of forces based on this representation is more robust to distortion arising from neural noise. In this process, the motor system estimates the difference between the desired generalized forces and generated generalized forces while taking into consideration noise added to the state variables that equivalently define the forces.”

      A more rational model might be based on Bayesian decision theory. Under such a model, the motor system would select motor commands that minimize some expected loss, averaging over the various possible underlying 'true' coordinate systems in which to generalize. It's not entirely clear without developing the theory a bit exactly how the proposed noise-based theory might deviate from such a Bayesian model. But the paper should more clearly explain the principles/assumptions of the proposed noise-based model and should emphasize how the model parallels (or deviates from) Bayesian-decision-theory-type models.

      As we understand the reviewer's suggestion, the idea is to estimate the weight of each coordinate system based on minimizing a loss function that considers the cost of each weight multiplied by a posterior probability that represents the uncertainty in this weight value. While this is an interesting idea, we believe that in the current problem, there are no ‘true’ weight values. That is, the motor system can use any combination of weights which will be true due to the ambiguous nature of the environment. Since the force field was presented in one area of the entire workspace, there is no observation that will allow us to update prior beliefs regarding the force nature of the environment. In such a case, the prior beliefs might play a role in the loss function, but in our opinion, there is no clear rationale for choosing unequal priors except guessing or fitting prior probabilities, which will resemble any other previous models that used fitting rather than predictions.

      Another significant weakness is that it's not clear how closely the weighting of the different coordinate frames needs to match the model predictions in order to recover the observed generalization patterns. Given that the weighting for a given movement direction is over- parametrized (i.e. there are 3 variable weights (allowing for decay) predicting a single observed force level, it seems that a broad range of models could generate a reasonable prediction. It would be helpful to compare the predictions using the weighting suggested by the model with the predictions using alternative weightings, e.g. a uniform weighting, or the weighting for a different posture. In fact, Fig. 7 shows that uniform weighting accounts for the data just as well as the noise-based model in which the weighting varies substantially across directions. A more comprehensive analysis comparing the proposed noise-based weightings to alternative weightings would be helpful to more convincingly argue for the specificity of the noise-based predictions being necessary. The analysis in the appendix was not that clearly described, but seemed to compare various potential fitted mixtures of coordinate frames, but did not compare these to the noise-based model predictions.

      We agree with the reviewer that fitted global weights, that is, an optimal weighted average of the three coordinate systems should outperform most of the models that are based on prediction instead of fitting the data. As we showed in Figure 7 of the submitted version of the manuscript, we used the optimal fitted model to show that our noise-based model is indeed not optimal but can predict the behavioral results and not fall too short of a fitted model. When trying to fit a model across all the reported experiments, we indeed found a set of values that gives equal weights for the joints and object coordinate systems (0.27 for both), and a lower value for the Cartesian coordinate system (0.12). Considering these values, we indeed see how the reviewer can suggest a model that is based on equal weights across all coordinate systems. While this model will not perform as well as the fitted model, it can still generate satisfactory results.

      To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. In this experiment, a model that is based on global fitted weights can only predict one out of two possible generalization patterns while models that are based on individual direction-predicted weights can predict a variety of generalization patterns. We show that global weights, although fitted to the data, cannot explain participants' behavior. We report these new results in Appendix 2.

      “To better understand if a model based on global weights can explain the combination between coordinate systems, we perform an additional experiment. We used the idea of experiment 3 in which participants generalize learned dynamics using a tool. That is, the arm posture does not change between the training and test areas. In such a case, the Cartesian and joint coordinate systems do not predict a shift in generalized force pattern while the object coordinate system predicts a shift that depends on the orientation of the tool. In this additional experiment, we set a test workspace in which the orientation of the tool is 90° (Appendix 2- figure 1A). In this case, for the test workspace, the force compensation pattern of the object based coordinate system is in anti-phase with the Cartesian/joint generalization pattern. Any globally fitted weights (including equal weights) can produce either a non-shifted or 90° shifted force compensation pattern (Appendix 2- figure 1B). Participants in this experiment (n=7) showed similar MPE reduction as in all previous experiments when adapting to the trigonometric scaled force field (Appendix 2- figure 1C). When examining the generalized force compensation patterns, we observed a shift of the pattern in the test workspace of 14.6° (Appendix 2- figure 1D). This cannot be explained by the individual coordinate system force compensation patterns or any combination of them (which will always predict either a 0° or 90° shift, Appendix 2- figure 1E). However, calculating the prediction of the Re-Dyn model we found a predicted force compensation pattern with a shift of 6.4° (Appendix 2- figure 1F). The intermediate shift in the force compensation pattern suggests that any global based weights cannot explain the results.”

      With regard to the suggestion that weighting is changed according to arm posture, two of our results lower the possibility that posture governs the weights:

      (1) In experiment 3, we tested generalization while keeping the same arm posture between the training and test workspaces, and we observed different force compensation profiles across the movement directions. If arm posture in the test workspaces affected the weights, we would expect identical weights for both test workspaces. However, any set of weights that can explain the results observed for workspace 1 will fail to explain the results observed in workspace 2. To better understand this point we calculated the global weights for each test workspace for this experiment and we observed an increase in the weight for the object coordinates system (0.41 vs. 0.5) and a reduction in the weights for the Cartesian and joint coordinates systems (0.29 vs. 0.24). This suggests that the arm posture cannot explain the generalization pattern in this case.

      (2) In experiments 2 and 3, we used the same arm posture in the training workspace and either changed the arm posture (experiment 2) or did not change the arm posture (experiment 3) in the test workspaces. While the arm posture for the training workspace was the same, the force generalization patterns were different between the two experiments, suggesting that the arm posture during the training phase (adaptation) does not set the generalization weights.

      Overall, this shows that it is not specifically the arm posture in either the test or the training workspaces that set the weights. Of course, all coordinate models, including our noise model, will consider posture in the determination of the weights.

      Reviewer #2 (Public Review):

      Leib & Franklin assessed how the adaptation of intersegmental dynamics of the arm generalizes to changes in different factors: areas of extrinsic space, limb configurations, and 'object-based' coordinates. Participants reached in many different directions around 360{degree sign}, adapting to velocity-dependent curl fields that varied depending on the reach angle. This learning was measured via the pattern of forces expressed in upon the channel wall of "error clamps" that were randomly sampled from each of these different directions. The authors employed a clever method to predict how this pattern of forces should change if the set of targets was moved around the workspace. Some sets of locations resulted in a large change in joint angles or object-based coordinates, but Cartesian coordinates were always the same. Across three separate experiments, the observed shifts in the generalized force pattern never corresponded to a change that was made relative to any one reference frame. Instead, the authors found that the observed pattern of forces could be explained by a weighted combination of the change in Cartesian, joint, and object-based coordinates across test and training contexts.

      In general, I believe the authors make a good argument for this specific mixed weighting of different contexts. I have a few questions that I hope are easily addressed.

      Movements show different biases relative to the reach direction. Although very similar across people, this function of biases shifts when the arm is moved around the workspace (Ghilardi, Gordon, and Ghez, 1995). The origin of these biases is thought to arise from several factors that would change across the different test and training workspaces employed here (Vindras & Viviani, 2005). My concern is that the baseline biases in these different contexts are different and that rather the observed change in the force pattern across contexts isn't a function of generalization, but a change in underlying biases. Baseline force channel measurements were taken in the different workspace locations and conditions, so these could be used to show whether such biases are meaningfully affecting the results.

      We agree with the reviewer and we followed their suggested analysis. In the following figure (Author response image 1) we plotted the baseline force compensation profiles in each workspace for each of the four experiments. As can be seen in this figure, the baseline force compensation is very close to zero and differs significantly from the force compensation profiles after adaptation to the scaled force field.

      Author response image 1.

      Baseline force compensation levels for experiments 1-4. For each experiment, we plotted the force compensation for the training, test 1, and test 2 workspaces.

      Experiment 3, Test 1 has data that seems the worst fit with the overall story. I thought this might be an issue, but this is also the test set for a potentially awkwardly long arm. My understanding of the object-based coordinate system is that it's primarily a function of the wrist angle, or perceived angle, so I am a little confused why the length of this stick is also different across the conditions instead of just a different angle. Could the length be why this data looks a little odd?

      Usually, force generalization is tested by physically moving the hand in unexplored areas. In experiment 3 we tested generalization using a tool which, as far as we know, was not tested in the past in a similar way to the present experiment. Indeed, the results look odd compared to the results of the other experiments, which were based on the ‘classic’ generalization idea. While we have some ideas regarding possible reasons for the observed behavior, it is out of the scope of the current work and still needs further examination.

      Based on the reviewer’s comment, we improved the explanation in the introduction regarding the idea behind the object based coordinate system

      “we could represent the forces as belonging to the hand or a hand-held object using the orientation vector connecting the shoulder and the object or hand in space (Berniker, Franklin et al. 2014).” The reviewer is right in their observation that the predictions of the object-based reference frame will look the same if we change the length of the tool. The object-based generalized forces, specifically the shift in the force pattern, depend only on the object's orientation but not its length (equation 4).

      The manuscript is written and organized in a way that focuses heavily on the noise element of the model. Other than it being reasonable to add noise to a model, it's not clear to me that the noise is adding anything specific. It seems like the model makes predictions based on how many specific components have been rotated in the different test conditions. I fear I'm just being dense, but it would be helpful to clarify whether the noise itself (and inverse variance estimation) are critical to why the model weights each reference frame how it does or whether this is just a method for scaling the weight by how much the joints or whatever have changed. It seems clear that this noise model is better than weighting by energy and smoothness.

      We have now included further details of the noise model and added to Figure 1 to highlight how noise can affect the predicted weights. In short, we agree with the reviewer there are multiple ways to add noise to the generalized force patterns. We choose a simple option in which we simulate possible distortions to the state variables that set the direction of movement. Once we calculated the variance of the force profile due to this distortion, one possible way is to combine them using an inverse variance estimator. Note that it has been shown that an inverse variance estimator is an ideal way to combine signals (e.g., Shahar, D.J. (2017) https://doi.org/10.4236/ojs.2017.72017). However, as we suggest, we do not claim or try to provide evidence for this specific way of calculating the weights. Instead, we suggest that giving greater weight to the less variable force representation can predict both the current experimental results as well as past results.

      Are there any force profiles for individual directions that are predicted to change shape substantially across some of these assorted changes in training and test locations (rather than merely being scaled)? If so, this might provide another test of the hypotheses.

      In experiments 1-3, in which there is a large shift of the force compensation curve, we found directions in which the generalized force was flipped in direction. That is, clockwise force profiles in the training workspace could change into counter-clockwise profiles in the test workspace. For example, in experiment 2, for movement at 157.5° we can see that the force profile was clockwise for the training workspace (with a force compensation value of 0.43) and movement at the same direction was counterclockwise for test workspace 1 (force compensation equal to -0.48). Importantly, we found that the noise based model could predict this change.

      Author response image 2.

      Results of experiment 2. Force compensation profiles for the training workspace (grey solid line) and test workspace 1 (dark blue solid line). Examining the force nature for the 157.5° direction, we found a change in the applied force by the participants (change from clockwise to counterclockwise forces). This was supported by a change in force compensation value (0.43 vs. -0.48). The noise based model can predict this change as shown by the predicted force compensation profile (green dashed line).

      I don't believe the decay factor that was used to scale the test functions was specified in the text, although I may have just missed this. It would be a good idea to state what this factor is where relevant in the text.

      We added an equation describing the decay factor (new equation 7 in the Methods section) according to this suggestion and Reviewer 1 comment on the same issue.

      Reviewer #3 (Public Review):

      The author proposed the minimum variance principle in the memory representation in addition to two alternative theories of the minimum energy and the maximum smoothness. The strength of this paper is the matching between the prediction data computed from the explicit equation and the behavioral data taken in different conditions. The idea of the weighting of multiple coordinate systems is novel and is also able to reconcile a debate in previous literature.

      The weakness is that although each model is based on an optimization principle, but the derivation process is not written in the method section. The authors did not write about how they can derive these weighting factors from these computational principles. Thus, it is not clear whether these weighting factors are relevant to these theories or just hacking methods. Suppose the author argues that this is the result of the minimum variance principle. In that case, the authors should show a process of how to derive these weighting factors as a result of the optimization process to minimize these cost functions.

      The reviewer brings up a very important point regarding the model. As shown below, it is not trivial to derive these weights using an analytical optimization process. We demonstrate one issue with this optimization process.

      The force representation can be written as (similar to equation 6):

      We formulated the problem as minimizing the variance of the force according to the weights w:

      In this case, the variance of the force is the variance-covariance matrix which can be minimized by minimizing the matrix trace:

      We will start by calculating the variance of the force representation in joints coordinate system:

      Here, the force variance is a result of a complex function which include the joints angle as a random variable. Expending the last expression, although very complex, is still possible. In the resulted expression, some of the resulted terms include calculating the variance of nested trigonometric functions of the random joint angle variance, for example:

      In the vast majority of these cases, analytical solutions do not exist. Similar issues can also raise for calculating the variance of complex multiplication of trigonometric functions such as in the case of multiplication of Jacobians (and inverse Jacobians)

      To overcome this problem, we turned to numerical solutions which simulate the variance due to the different state variables.

      In addition, I am concerned that the proposed model can cancel the property of the coordinate system by the predicted variance, and it can work for any coordinate system, even one that is not used in the human brain. When the applied force is given in Cartesian coordinates, the directionality in the generalization ability of the memory of the force field is characterized by the kinematic relationship (Jacobian) between the Cartesian coordinate and the coordinate of interest (Cartesian, joint, and object) as shown in Equation 3. At the same time, when a displacement (epsilon) is considered in a space and a corresponding displacement is linked with kinematic equations (e.g., joint displacement and hand displacement in 2 joint arms in this paper), the generated variances in different coordinate systems are linked with the kinematic equation each other (Jacobian). Thus, how a small noise in a certain coordinate system generates the hand force noise (sigma_x, sigma_j, sigma_o) is also characterized by the kinematics (Jacobian). Thus, when the predicted forcefield (F_c, F_j, F_o) was divided by the variance (F_c/sigma_c^2, F_j/sigma_j^2, F_o/sigma_o^2, ), the directionality of the generalization force which is characterized by the Jacobian is canceled by the directionality of the sigmas which is characterized by the Jacobian. Thus, as it has been read out from Fig*D and E top, the weight in E-top of each coordinate system is always the inverse of the shift of force from the test force by which the directionality of the generalization is always canceled.

      Once this directionality is canceled, no matter how to compute the weighted sum, it can replicate the memorized force. Thus, this model always works to replicate the test force no matter which coordinate system is assumed. Thus, I am suspicious of the falsifiability of this computational model. This model is always true no matter which coordinate system is assumed. Even though they use, for instance, the robot coordinate system, which is directly linked to the participant's hand with the kinematic equation (Jacobian), they can replicate this result. But in this case, the model would be nonsense. The falsifiability of this model was not explicitly written.

      As explained above, calculating the variability of the generalized forces given the random nature of the state variable is a complex function that is not summarized using a Jacobian. Importantly the model is unable to reproduce or replicate the test force arbitrarily. In fact, we have already shown this (see Appendix 1- figure 1), where when we only attempt to explain the data with either a single coordinate system (or a combination of two coordinate systems) we are completely unable to replicate the test data despite using this model. For example, in experiment 4, when we don’t use the joint based coordinate system, the model predicts zero shift of the force compensation pattern while the behavioral data show a shift due to the contribution of the joint coordinate system. Any arbitrary model (similar to the random model we tested, please see the response to Reviewer 1) would be completely unable to recreate the test data. Our model instead makes very specific predictions about the weighting between the three coordinate systems and therefore completely specified force predictions for every possible test posture. We added this point to the Discussion

      “The results we present here support the idea that the motor system can use multiple representations during adaptation to novel dynamics. Specifically, we suggested that we combine three types of coordinate systems, where each is independent of the other (see Appendix 1- figure 1 for comparison with other combinations). Other combinations that include a single or two coordinate system can explain some of the results but not all of them, suggesting that force representation relies on all three with specific weights that change between generalization scenarios.”

    1. Author Response

      Reviewer #1:

      This is a very timely paper that addresses an important and difficult-to-address question in the decision-making field - the degree to which information leakage can be strategically adapted to optimise decisions in a task-dependent fashion. The authors apply a sophisticated suite of analyses that are appropriate and yield a range of very interesting observations. The paper centres on analyses of one possible model that hinges on certain assumptions about the nature of the decision process for this task which raises questions about whether leak adjustments are the only possible explanation for the current data. I think the conclusions would be greatly strengthened if they were supported by the application and/or simulation of alternative model structures.

      We thank the reviewer for this positive appraisal of our study. We now entirely agree with their central comment about whether leak adjustments are the only (or even the best) explanation for the current data. We hope that the additional modelling sections that we have discussed in response to main comment 1 above have strengthened the paper. We have responded point-by-point to their public review, as this contained their main recommendations for revision.

      The behavioural trends when comparing blocks with frequent versus rare response periods seem difficult to tally with a change in the leak. […] Are there other models that could reproduce such effects? For example, could a model in which the drift rate varies between Rare and Frequent trials do a similar or better job of explaining the data?

      We can see why the reviewer has advocated for a possible change of drift rate (or ‘gain’ applied to sensory evidence) between conditions to explain our behavioural findings. We found, however, that changes in drift rate could elicit qualitatively similar changes in integration kernels to changes in decision threshold:

      Author response image 1.

      Changes in gain applied to incoming sensory evidence (A parameter in model) have similar effects on recovered integration kernels from Ornstein-Uhlenbeck simulation as changes in decision threshold.

      The likely reason for this is that the overall probability of emitting a response at any point in the continuous decision process is determined by the ratio of accumulated evidence to decision threshold. A similar logic applies to effects on reactions times and detection probability (main figure 2): increasing sensory gain/decreasing decision threshold will lead to faster reaction times and increased detection probability during response periods.

      Both parameters may even have a similar effect on ‘false alarms’, because (as the reviewer notes below) false alarms in our paradigm are primarily being driven by the occurrence of stimulus changes as well as internal noise. In fact, the false alarm findings mean it is difficult to fully reconcile all of our behavioural findings in terms of changes in a single set of model parameters in the O-U process. It is possible that other changes not considered within our model (such as expectations of hazard rates of inter-response intervals leading to dynamic thresholds etc.) may have had a strong impact upon the resulting false alarm rates. A full exploration of different variations in O-U model (with varying urgency signals, hazard rates, etc.) is beyond the scope of this paper.

      For this reason, we have decided in our new modelling section to focus primarily on a single, well-established model (the O-U process) and explore how changes in leak and threshold affect task performance and the resulting integration kernels. We note that this is in line with the suggestion of reviewer #2, who focussed on similar behavioural findings to reviewer #1 but suggested that we look at decision threshold rather than drift rate as our primary focus.

      This ties in to a related query about the nature of the task employed by the authors. Due to the very significant volatility of the stimulus, it seems likely that the participants are not solely making judgments about the presence/absence of coherent motion but also making judgments about its duration (because strong coherent motion frequently occurs in the inter-target intervals). If that is so, then could the Rare condition equate to less evidence because there is an increased probability that an extended period of coherent motion could be an outlier generated from the noise distribution? Note that a drift rate reduction would also be expected to result in fewer hits and slower reaction times, as observed.

      As mentioned above, the rare and frequent targets are indeed matched in terms of the ease with which they can be distinguished from the intervening noise intervals. To confirm this, we directly calculated the variance (across frames) of the motion coherence presented during baseline periods and response periods (until response) in all four conditions:

      Author response image 2.

      The average empirical standard deviation of the stimulus stream presented during each baseline period (‘baseline’) and response period (‘trial’), separated by each of the four conditions (F = frequent response periods, R = rare, L = long response periods, S = short). Data were averaged across all response/baseline periods within the stimuli presented to each participant (each dot = 1 participant). Note that the standard deviation shown here is the standard deviation of motion coherence across frames of sensory evidence. This is smaller than the standard deviation of the generative distribution of ‘step’-changes in the motion coherence (std = 0.5 for baseline and 0.3 for response periods), because motion coherence remains constant for a period after each ‘step’ occurs.

      Some adjustment of the language used when discussing FAs seems merited. If I have understood correctly, the sensory samples encountered by the participants during the inter-response intervals can at times favour a particular alternative just as strongly (or more strongly) than that encountered during the response interval itself. In that sense, the responses are not necessarily real false alarms because the physical evidence itself does not distinguish the target from the non-target. I don't think this invalidates the authors' approach but I think it should be acknowledged and considered in light of the comment above regarding the nature of the decision process employed on this task.

      This is a good point. We hope that the reviewer will allow us to keep the term ‘false alarms’ in the paper, as it does conveniently distinguish responses during baseline periods from those during response periods, but we have sought to clarify the point that the reviewer makes when we first introduce the term.

      “Indeed, participants would occasionally make ‘false alarms’ during baseline periods in which the structure of the preceding noise stream mistakenly convinced them they were in a response period (see Figure 4, below). Indeed, this means that a ‘false alarm’ in our paradigm has a slightly different meaning than in most psychophysics experiments; rather than it referring to participants responding when a stimulus was not present, we use the term to refer to participants responding when there was no shift in the mean signal from baseline.”

      And:

      “The fact that evidence integration kernels naturally arise from false alarms, in the same manner as from correct responses, demonstrates that false alarms were not due to motor noise or other spurious causes. Instead, false alarms were driven by participants treating noise fluctuations during baseline periods as sensory evidence to be integrated across time, and the physical evidence preceding ‘false alarms’ need not even distinguish targets from non-targets.”

      The authors report that preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods. It is not clear what identifies this signal as reflecting motor preparation. Did the authors consider using other effectorselective EEG signatures of motor preparation such as beta-band activity which has been used elsewhere to make inferences about decision bounds? Assuming that this central ERP signal does reflect the decision bounds, the observation that it has a larger amplitude at the response on Rare trials appears to directly contradict the kernel analyses which suggest no difference in the cumulative evidence required to trigger commitment.

      Thanks for this comment. First, we should simply comment that this finding emerged from an agnostic time-domain analysis of the data time-locked to button presses, in which we simply observed that the negative-going potential was greater (more negative) in RARE vs. FREQUENT trials. So it is simply the fact that it precedes each button press that we relate it to motor preparation; nonetheless, we note that (Kelly and O’Connell, 2013) found similar negative-going potentials at central sensors without applying CSD transform (as in this study). Like them, we would relate this potential to either the well-established Bereitschaftpotential or the contingent negative potential (CNV).

      We agree that many other studies have focussed on beta-band activity as another measure of motor preparation, and to make inferences about decision bounds. To investigate this, we used a Morlet wavelet transform to examine the time-varying power estimate at a central frequency of 20Hz (wavelet factor 7). We repeated the convolutional GLM analysis on this time-varying power estimate.

      We first examined average beta desynchonisation at a central cluster of electrodes (CPz, CP1, CP2, C1, Cz, C2) in the run-up to correct button presses during response periods. We found a reliable beta desynchonisation occurred, and, just as in the time-domain signal, this reached a greater threshold in the RARE trials than in the FREQUENT trials:

      Author response image 3.

      Beta desynchronisation prior to a correct response is greater over central electrodes in the RARE condition than in the FREQUENT condition.

      We agree with the reviewer that this is likely indicative of a change in decision threshold between rare and frequent trials. We also note that our new computational modelling of the O-U process suggests that this in fact reconciles well with the behavioural findings (changes in integration kernels). We now mention this at the relevant point in the results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      We did also investigate the lateralised response (left minus right beta-desynchronisation, contrasted on left minus right responses). We found, however, that we were simply unable to detect a reliable lateralised signal in either condition using these lateralised responses. We suspect that this is because we have far fewer response periods than conventional trialbased EEG experiments of decision making, and so we did not have sufficient SNR to reliably detect this signal. This is consistent with standard findings in the literature, which report that the magnitude of the lateralised signal is far smaller than the magnitude of the overall beta desynchronisation (e.g. (Doyle et al., 2005))

      P11, the "absolute sensory evidence" regressor elicited a triphasic potential over centroparietal electrodes. The first two phases of this component look to have an occipital focus. The third phase has a more centroparietal focus but appears markedly more posterior than the change in evidence component. This raises the question of whether it is safe to assume that they reflect the same process.

      We agree. We have now referred to this as a ‘triphasic component over occipito-parietal cortex’ rather than centroparietal electrodes.

      Reviewer #2:

      Overall, the authors use a clever experimental design and approach to tackle an important set of questions in the field of decision-making. The manuscript is easy to follow with clear writing. The analyses are well thought-out and generally appropriate for the questions at hand. From these analyses, the authors have a number of intriguing results. So, there is considerable potential and merit in this work. That said, I have a number of important questions and concerns that largely revolve around putting all the pieces together. I describe these below.

      Thanks to the reviewer for their positive appraisal of the manuscript; we are obviously pleased that they found our work to have considerable potential and merit. We seek to address the main comments from their public review and recommendations below.

      1) It is unclear to what extent the decision threshold is changing between subjects and conditions, how that might affect the empirical integration kernel, and how well these two factors can together explain the overall changes in behavior.

      I would expect that less decay in RARE would have led to more false alarms, higher detection rates, and faster RTs unless the decision threshold also increased (or there was some other additional change to the decision process). The CPP for motor preparatory activity reported in Fig. 5 is also potentially consistent with a change in the decision threshold between RARE and FREQUENT. If the decision threshold is changing, how would that affect the empirical integration kernel? These are important questions on their own and also for interpreting the EEG changes.

      This important comment, alongside the comments of reviewer 1 above, made us carefully consider the effects of changes in decision threshold on the evidence integration kernel via simulation. As discussed above (in response to ‘essential revisions for the authors’), we now include an entirely new section on how changes in decision threshold and leak may affect the evidence integration kernel, and be used to optimise performance across the different sensory environments. In particular, we agree with the reviewer that the motor preparatory activity that differs between RARE and FREQUENT is consistent with a change in decision threshold, and our simulations have suggested that our behavioural findings on evidence integration are also consistent with this change as well. These are detailed on pp.1-4 of the rebuttal, above.

      2) The authors find an interesting difference in the CPP for the FREQUENT vs RARE conditions where they also show differences in the decay time constant from the empirical integration kernel. As mentioned above, I'm wondering what else may be different between these conditions. Do the authors have any leverage in addressing whether the decision threshold differs? What about other factors that could be important for explaining the CPP difference between conditions? Big picture, the change in CPP becomes increasingly interesting the more tightly it can be tied to a particular change in the decision process.

      We fully agree with the spirit of this comment, and we’ve tried much more carefully to consider what the influences of decision threshold and leak would be on our behavioural analyses. As discussed in the response to reviewer 1, we think that the negative-going potential at the time of responses (which is greater in RARE vs. FREQUENT, main figure 7b, and mirrored by equivalent changes in beta desynchronisation, see Reviewer Response Figure 5 above) are both reflective of a change in decision threshold between RARE and FREQUENT conditions. We have tried to make this link explicit in the revised results section:

      “As large changes in mean evidence are less frequent in the RARE condition, the increased neural response to |Devidence| may reflect the increased statistical surprise associated with the same magnitude of change in evidence in this condition. In addition, when making a correct response, preparatory motor activity over central electrodes reached a larger decision threshold for RARE vs. FREQUENT response periods (Figure 7b; p=0.041, cluster-based permutation test). We found similar effects in beta-band desynchronisation prior, averaged over the same electrodes; beta desynchronisation was greater in RARE than FREQUENT response periods. As discussed in the computational modelling section above, this is consistent with the changes in integration kernels between these conditions as it may reflect a change in decision threshold (figure 2d, 3c/d). It is also consistent with the lower detection rates and slower reaction times when response periods are RARE (figure 2 b/c).”

      I'll note that I'm also somewhat skeptical of the statements by the authors that large shifts in evidence are less frequent in the RARE compared to FREQUENT conditions (despite the names) - a central part of their interpretation of the associated CPP change. The FREQUENT condition obviously has more frequent deviations from the baseline, but this is countered to some extent by the experimental design that has reduced the standard deviation of the coherence for these response periods. I think a calculation of overall across-time standard deviation of motion coherence between the RARE and FREQUENT conditions is needed to support these statements, and I couldn't find that calculation reported. The authors could easily do this, so I encourage them to check and report it.

      See Author response image 2.

      3) The wide range of decay time constants between subjects and the correlation of this with another component of the CPP is also interesting. However, in trying to interpret this change in CPP, I'm wondering what else might be changing in the inter-subject behavior. For instance, it looks like there could be up to 4 fold changes in false alarm rates. Are there other changes as well? Do these correlate with the CPP? Similar to my point above, the changes in CPP across subjects become increasingly interesting the more tightly it can be tied to a particular difference in subject behavior. So, I would encourage the authors to examine this in more depth.

      Thanks for the interesting suggestion. We explored whether there might be any interindividual correlation in this measure with the false alarm rate across participants, but found that there was no such correlation. (See Author response image 4; plotting conventions are as in main figure 9).

      Author response image 4.

      No evidence of between-subject correlations in CPP responses and false alarm rates, in any of the four conditions.

      We hope instead that the extended discussion of how the integration kernel should be interpreted (in light of computational modelling) provides at least some increased interpretability of the between-subject effects that we report in figure 9.

      Reviewer #3 (Public Review):

      The main strength is in the task design which is novel and provides an interesting approach to studying continuous evidence accumulation. Because of the continuous nature of the task, the authors design new ways to look at behavioral and neural traces of evidence. The reverse-correlation method looking at the average of past coherence signals enables us to characterize the changes in signal leading to a decision bound and its neural correlate. By varying the frequency and length of the so-called response period, that the participants have to identify, the method potentially offers rich opportunities to the wider community to look at various aspects of decision-making under sensory uncertainty.

      We are pleased that the reviewer agrees with our general approach as a novel way of characterising various aspects of decision-making under uncertainty.

      The main weaknesses that I see lie within the description and rigor of the method. The authors refer multiple times to the time constant of the exponential fit to the signal before the decision but do not provide a rigorous method for its calculation and neither a description of the goodness of the fit. The variable names seem to change throughout the text which makes the argumentation confusing to the reader. The figure captions are incomplete and lack clarity.

      We apologise that some of our original submission was difficult to follow in places, and we are very grateful to the reviewer for their thorough suggestions for how this could be improved. We address these in turn below, and we hope that this answers their questions, and has also led to a significant improvement in the description and rigour of the methodology.

    1. Author Response

      Reviewer #2 (Public Review):

      I am not a specialist in cryo-EM, so cannot comment on the technicalities of the structure reconstruction or methods used. I thus focus on the conclusions and observations that the authors provide in the manuscript and their relevance to functional photosynthesis.

      The authors attempt to resolve the structure of PSII from Dunaliella and noticed that three types of PSII could be identified: two conformational states, and a stacked configuration. There is no doubt that these structures add to our current knowledge of PSII and that they exist in abundance upon solubilisation of the sample. My main issue however is the relevance to in vivo conditions, and the efforts to exclude the possibility that pigment loss and conformational states and stacking are a reflection of ex-vivo manipulations.

      Our compact model contains 202 Chls molecules while the stretched conformation contains 206 Chls. All of the differences in Chl binding are attributed to CP29. We have compiled a table enumerating the different CP29 structures currently available from plants and green alga at similar resolution to our work (Supplementary table 2). In the larger plant complexes (C2S2M2) CP29 contains 14 chls, while CP29 in smaller C2S2 complexes contains 10-13 chls, so it appears the some chl loss from CP29 is associated with the release of LHCIIM. In the green alga structures, CP29 contains less chls in general and shows a similar trend. The currently published structure most relevant to our work contains 8 chls (6KAC), a somewhat lower amount then both the compact and stretched models (9 and 11 chls, respectively). The stretched orientation, which is the closest match to the known PSII core arrangement, therefore contains more chls than comparable models. While the in-vivo configuration is not known in the sense that it could contain more chls, the current structure is apparently the closest representation of it.

      The presence of CP29 with lower chls content in the chlamy C2S2 (6KAC, which is in a stretched orientation) supports a conclusion that pigment loss from CP29 alone is not sufficient to trigger the stretch to compact transition although it is associated with it. In general, the precise orientation of CP29 is variable and seem to depend on the binding of additional LHCII, it is possible that some chl loss is accompanied with these changes in vivo.

      I see a number of questions pertaining to this work. Starting from the two conformations of PSII, compact and stretched, the authors say that both are highly active based on oxygen measurements at a saturating light intensity. In the meantime, they report large variations in the chl content and positions of the chlorophyll molecules in these structures (also compared to other known PSIIs). This gives the impression that one can lose two chlorophylls, and freely modify the distance between others without losing efficiency, certainly a risky conclusion. Are the samples highly active also in light-limiting conditions? It is thought that even tiny movements and alterations in chl-chl distances alter their coupling and spectral properties, how come the variations in this report are so huge? In other words, the assay tests the charge separation activity of the PSII RC in the preps, but not the light-harvesting efficiency.

      The chl content differences reported in this work amounts to 2%. In our opinion this represents quite a low variation in pigment content, which exist in virtually any experiment involving large complexes. We agree that measurements of activity in limiting light conditions are interesting, however this goes beyond the scope of the current work. Light harvesting efficiency in PSII is known to vary substantially as a result of additional mechanisms (NPQ in some of its forms), not associated with chl loss or gain. While the formation of quenching centers is attributed to small structural changes within specific pigment protein complexes, what we are showing in this work are structural changes between pigment protein complexes. These can affect transfer rates between the different complexes but are distinct from the structural changes thought to accompany the formation of quenching centers within specific pigment protein complexes.

      How does one ascertain that the lost chlorophyll molecules in CP29 are not a preparation error? Does slightly increasing the detergent concentration impact the proportion of stretched:compact forms?

      The effect of detergent concentration on the proportion of the different forms was not tested directly. However, we do not detect many differences in lipids or bound detergent molecules content between the two conformations, suggesting that for these “ligands” the differences are not substantial. We can only distinguish these two forms at the very last stages of data processing, at the present state of cryoEM cost and time availability, mapping the effect of detergent concentration on the different orientations is outside our reach.

      On a similar note, how do the authors exclude that a certain interaction with this type of grid impacts the distribution of these complexes? Is it identical to a biologically separate preparation of algae? In case of discoveries of this type, it is of high importance to exclude as many possibilities of non-native conditions or influences on the structure.

      It’s hard to completely exclude grid and sample preparation issues. However, we employed relatively standard grids and vitrification conditions. The observed complexes are embedded in vitrified ice and do not interact with the grid directly. The differences we observed are mainly in the orientations of the PSII cores, all the interactions between PSII subunits within each core are preserved and agree with previously published structures. Since the interactions within the core and between cores involve the same physical principles, we think its fairly conservative to think that the observed core orientations are not an artefact of sample preparation.

      I would further like to encourage the authors to elaborate on the CP29 phosphorylation. What is the proportion of PSIIcomp that are phosphorylated? I assume it is not 100%, as in this case, the authors would propose that this is the effect that modulates between compact and stretched architectures.

      Its difficult to estimate the proportion of observed phosphorylation/sulfinylation. To be detected in maps, most of the residues (above 50%) are probably modified. We attempted to estimate this by refining the atom occupancies of the Pi molecule on Ser84 and the oxygens attached to Cys218, both values suggested that about 70% of the complexes are modified. With regards to the possibility that these modifications can promote the formation of the compact state, we think that this is certainly a possibility, since these modifications were detected in this state and are in close proximity to each other. However, this can also result from the resolution differences of the maps and the structural implications of both modifications are hard to predict. At this point we prefer to note their existence without further interpretations.

      In line 290, the authors highlight the structural heterogeneity within the two groups' PSII conformations. I would like to see how does the distribution look like for all the structures together: are the two (stretched and compact) specifically forming two heterogenous distributions? Or is it possible that the distribution between the two is quasi-continuous? In other words, if the structures are not perfectly defined, how do the authors decide that two- and not more or less subtypes exist?

      We went back and refined the initial particle group (containing both compact and stretched orientations) using multibody with masks defining the two PSII monomers. This analysis showed the expected two peaks only in the first Principal components which accounted for ~38% of the variance in the dataset.

      Multibody refinement carried out on the combined particle dataset shows one very large PC accounting for about 38% of the variance and the presence of two distinct peaks in the particle distribution of the first PC.

      From this analysis it’s clear that there are two distinct classes in this particle set (as expected), as none of the other PC’s shows any signs of multiple peaks, this analysis suggests that two distinct models are the best representation of this eukaryotic PSII. Whether these are quasi continuous or distinct is more complex. There is continuity in this representation (particle distributions along PC), a different picture may appear if characters such as CP29 state are considered, but the size of CP29 and the remaining heterogeneity does not provide enough signal to carry out this classification at the moment.

      Considering the stacked PSII, I also have a few concerns. Contrary to previous studies the authors do not assign a functional role to the stacking beyond the structural aspect. This could be better backed by a discussion about the closest chlorophyll a molecules across the stacked PSII, which given the rather large distance shown in fig. 4L seems to be too large for any EET across the stromal gap.

      The closest chl-chl distance that we can measure in the stacked PSII dimer is ~54 Å, with most distances at the ~70 Å range, making EET between staked complexes very slow. We have added a statement clarifying this to our manuscript. In our opinion a structural role for the staked PSII dimer is more likely.

      There is a report that suggests the presence of some density between the stacked PSII - could the authors comment on the differences between it and their work? Are the angles and positions conserved between these types of stacks? https://doi.org/10.1038/s41598-017-10700-8

      We referred to Albanese et al, in our manuscript. We isolated the C2S2 complex from green alga, the analysis in Albanese et al was done on C2S2M1 complexes from pea and this can account for some of the differences. At any rate, our conclusion that we don’t find any evidence for protein linkers in the stacked complex is stated clearly. The angles described in Albanese et al are consistent with our analysis.

      Line 387, the authors state that due to the transient nature of the interactions across the stromal gap, the stacks could be "under-detected" in cryo-ET data. This statement is in my opinion misformulated. For once, the transient interaction argument would apply the same (if not more due to changing conditions induced by the purification process) to the single particle analysis performed in this paper. Second, tomographic volumes detect hundreds of PSII in a suspended state. Any transient interaction that adds up to 25% of particle population in a steady state cell should be clearly visible, while the in situ data suggests not more than random cross-stromal-gap orientations. Of course, this can be a specificity of Chlamydomonas or a particular growth condition. The statement used by the authors could be indeed converted into: the PSII stacks are over-detected in vitro, and it is certainly a simpler explanation for their presence. It is also important to mention that PSII stacking alone is not the only reason for grana architecture - stacking with the antenna of larger complexes, absent in the authors' preparation could also contribute to grana maintenance; and auxiliary proteins such as CURT help with this issue as well. Here a recent demonstration of the importance of minor antenna should probably be also cited: https://doi.org/10.1101/2021.12.31.474624

      We used the term “flexible” rather than “transient” to describe the interactions within the stacked PSII dimer. Our data (and tomographic data) do not contain any temporal component. When we used the term under-detected we refer to the fact that PSII is mainly detected by the luminal extrinsic subunits. The flexibility detected in our analysis may affect the concurrent visibly of these features in the PSII complexes making up an individual PSII stack. Specifically, Wietrzynski et al mainly analyze C2S2M2L2 complexes while our analysis only contained C2S2 complexes. It is likely that the different amount of bound LHCII affect PSII stacking as well. For example, Wietrzynski et al, show some overlap between LHCII complexes and little overlap between cores in the larger complexes they analyzed. We observe mainly core to core overlap with little LHCII overlap in the smaller C2S2, although we did not observe any states where LHC’s were not included in what appear to be the binding interface. We agree with the reviewer on the relevance Lhcb’s and CURT contributions to stacking but prefer to focus on what was directly demonstrated in our data. We clearly note that we are discussing in-vitro results.

      Taking these last thoughts, I would like to finish by mentioning one more thing - almost philosophical. The authors are certainly at the forefront of the booming cryoEM revolution in biology which is profoundly changing the way we understand the living. There is absolutely zero doubt that this powerful technique is of the highest interest. But a growing number of structures of photosynthetic complexes remain puzzling, in particular with regard to their abundance in vivo (such as the PSII stacks) and functional relevance. How do we ascertain that these interactions are not due to in vitro preparation (isolation from cells, solubilisation)? Which ways can we use to try to exclude this (simple) hypothesis? I suggest that at least a small extent of biological replicas - experiments performed on separate batches, in different technical conditions, with slightly altered solubilization conditions, and so on - could shed light on the nature of these structures and their occurrence in vivo. Technical reps of the freezing+analysis pipeline could also be tried to see the variability. This would strongly reinforce this manuscript and its conclusions, and while not completely unequivocal (the stacked PSII, for example, could form upon each purification), a quantification of the effects would be of high interest.

      We certainly share the reviewer hope of being able to conduct cause and effect cryoEM experiments covering a complete set of experimental parameters. This is still beyond reach in terms of time and cost. Within each cryoEM experiment, however, all the analysis is consistent and, more importantly, transparent with regards to image analysis, which is the most important factor in our opinion. Preparation artefacts are always a possibility but, in our opinion, cryoEM is not affected by them differentially compared to other techniques. As we mentioned above, the particles are being observed suspended in vitreous ice, this is not different, and one can say even better, then numerous low temperature spectroscopic observations on samples suspended in glass state or crystals obtained in the presence of high concentrations of various agents. One thing that validates structural studies are the chemical details (bond lengths and angles etc…) underlying every model which are consistence with known values to close tolerances.

      Reviewer #3 (Public Review):

      In this manuscript, Caspy et al. present a detailed structural analysis of eukaryotic photosystem II (PSII) isolated from the green alga Dunaliella salina. By combining single-particle cryo-EM with multibody refinement, the authors not only reveal a high-resolution (2.4Å) structure of the eukaryotic PSII, but also demonstrate alternate conformations and intrinsic flexibility of the overall complex. Stretched and compact conformations of the PSII dimer were readily identified within the single-particle dataset. From this structural analysis, the authors propose that excitation energy transfer properties may be modulated by changes in transfer distance between key chlorophyll molecules observed in different conformational states of the PSII dimer. Due to the high resolution of the maps obtained, the authors identify post-translational modifications and a sodium binding site based on the observed cryo-EM maps. Additionally, the authors analyze PSII complexes in stacked and unstacked configurations, and find that compact and stretched states also exist within the stacked PSII complexes. From their cryo-EM maps, the authors demonstrate that there is no direct protein-protein interaction between stacked PSII complexes, and rather propose a model wherein long-range electrostatic interactions mediated by divalent cations such as magnesium, can facilitate PSII stacking.

      The conclusions and models presented in the manuscript are mostly well justified by the data. The cryo-EM maps are high quality and the models appear generally well refined. However, some aspects of data processing and analysis, as well as the resultant conclusions need to be clarified.

      1) In general, it is not clear from the cryo-EM processing workflow (suppl. Fig 1) or the methods section when exactly symmetry was applied during 3D classification and refinement. In the case of C2S2 unstacked particles, when was symmetry first applied in the overall processing workflow? To identify the compact and stretched configurations of C2S2, did the 3D classification without alignment (and/or the refinement preceding this classification) have C2 symmetry applied? If so, have you considered the possibility that some particles may actually be asymmetric in some regions?

      We modified figure S1 to clearly indicate the use of symmetry and particle expansion. In general, we refined most of the particle sets without symmetry (C1). At the final processing stage of the unstacked PSII sets, after we separated both conformations, we used C2 symmetry to expand the data, this was followed by multibody refinement. No symmetry or symmetry expansion was used for the stacked PSII particle sets.

      2) Following multibody refinement in Relion individual maps and half-maps for each body will be generated. There is no mention in the methods of how these individual maps for each C2S2 "monomer" were combined to produce an overall map of the dimer following multibody refinement. There are several methods currently used to combine such maps, including taking the maximum or average of the two maps or using a model-based approach in phenix. The authors should be explicit about the method they used, any potential artifacts that may develop from this map combination process, and/or the interface between masks used in multibody refinement.

      We used phenix.combined_focused_maps to combine the maps. This is now indicated in the method section.

      3) In addition to the point raised above, following multibody refinement there will be an individual FSC curve and resolution for each body. However, in supplemental figure 2 and supplemental table 1, only a single FSC curve and resolution are reported. Are these FSC curves/resolutions only reported for the better of the two bodies? If not, how was a single resolution calculated for the overall map of combined bodies?

      Both FSC curves were calculated and were highly similar, as expected following C2 expansion. This can also be evaluated from the local resolution maps which are highly similar between the two bodies. The reported resolutions are all taken from the displayed FSC curves generated through relion PostProcess.

      4) One of the major conclusions from the 3D classification and multibody refinement is that conformational changes and inherent flexibility of the PSII dimers have the potential to change distances between cofactors in the complex, ultimately leading to altered excitation energy transfer. However, it is unclear whether or not the authors believe one conformation over another may more readily support the evolution of oxygen. It would be nice if the authors could elaborate slightly upon this topic in the discussion.

      As discussed above the structural changes associated with the formation of quenching centers are not expected to be detected in the current work. The changes we observe can however affect the transfer to such centers and by doing so can play an important part in PSII biology. We do not detect any changes around the OEC and we don’t find any reason to think the two conformations are different with respect to their ETC.

      5) Along the lines of point 4 above, on line 95 the authors claim that "the high specific activity of 816 umol O2/ (mg Chl * hr) suggest that" both the C2S2 compact and stretched conformation are highly active. However, it is not clear to me why this measure of specific activity would suggest that both PSII conformations should have "high" activity. Maybe a reference here would help guide readers to previous measures of specific activity?

      Looking at specific activity from previously published structural studies on eukaryotic PSII we find that Sheng et al, 2019 reported on a specific activity of 272 mol O2/ (mg Chl * hr), this difference can stem partially from the presence of larger complexes in their preparation and is comparable to the activity that we measured in our As fraction (276 mol O2/ (mg Chl * hr), Figure 1-figure supplement 9). Reported specific activity values from plants (Pisum sativum) are also similar, Su et al, reported on a maximal value of 288 mol O2/ (mg Chl * hr), again, for larger complexes which can explain some of the difference. However, the specific activity measured for the C2S2 PSII isolated in the current study is 2.8 X higher than this value, more than the differences in chl content which ranges between 1.5 X to 2 X in favor of the larger complexes. If either one of the conformations is not as active, it would only mean that the other conformation will display even higher specific activity which seems less likely. In addition, we find no difference around the oxygen evolution center or in the peripheral luminal subunits in both the shape or map strength so both orientations show highly similar structures around these regions which determine the oxygen evolution activity.

      6) It is claimed that "more than 2100 water molecules were detected in the C2S2 compressed model", and the water distribution is shown in Figure 3. Obtaining resolutions capable of visualizing waters with cryo-EM is still a significant challenge. Upon visual inspection of the map supplied, it appears that several of the waters that were built into the atomic model simply do not have supporting peaks in the coulomb potential map above the level of noise. While some of the modeled waters are certainly supported by the map, in my opinion, there are many waters that simply are not, or at best are questionable. What method or tool was originally used to build waters into the model, and how were these waters subsequently validated during structure refinement?

      We followed standard methods for water placement and refinement in the preparation of the model, in addition to manually curating the water structure. However, in light of the reviewer comment we undertook additional rounds of refinement and inspection of the water molecules in the model. We removed a few hundred water molecules so that the total number of water molecules is now around 1700. All the water molecules in the present model should be well supported at maps values higher then 2.5 sigma and in our opinion the current water model should be regarded as conservative and underestimates the number of bound water molecules. This also led to some improvements in additional validation statistics of the model which are listed in the Table 1. The new model has been deposited in the PDB and the new PDB validation report is included in our resubmission.

      7) The authors claim to identify several unique map densities during model building. One of these is a sodium ion close to the OEC, which is coordinated by D1-His337, several backbone carbonyls, and a water molecule. When looking closely at the cryo-EM map supplied, it appears that the coulomb potential map is quite weak for this sodium, and is only visible at quite low contour levels. In fact, the features for the coordinating water, and chloride ions located ~7-9A away are much stronger than the sodium. Do the authors have any explanation for why the cryo-EM map is significantly weaker for the sodium compared to the coordinating water or chloride ions in the same general vicinity? Similar to what they did for the other post-translational modifications, the authors should consider showing the actual cryo-EM map for the bound sodium in supplemental Figure 10 a,b.

      Our main support for the placement of a Na+ ion in this location stems from the analysis of Wang et al. Our maps show the presence of a density which is discernible at 4 σ with an elongated shape suggesting the presence of multiple atoms/waters. Although in principle positive ions should have very strong densities in cryoEM maps due to their interactions with electrons, other factors such as occupancy, coordination and b-factor also play a role making the distinction between water and sodium complicated and case specific. The sodium peak is not observed in unsharpened maps (as do most of the water molecules which occupy conserved positions).

        We collected a few examples from comparable cases (cryo-EM maps of similar resolution ranges) where the presence of sodium ions is highly probable based on additional evidence. These maps densities highlight the factors we discussed above. In cases ‘a’ (dual oxidase 1 prepared in high sodium conditions) and ‘b’ (human voltage-gated sodium channel), Na+ is observed in a highly coordinated states and especially in ‘a’ shows the expected increase density values compared to water molecules. However, cases ‘d’ (human Na+/K+ P type Atpase) and ‘e’ (voltage-gated sodium channel) appear very similar to the proposed Na+ assignment in PSII. We conclude that map density alone is not enough to distinguish between Na+ and water molecules and rely on the additional experiments described by Wang et al. which show increase PSII activity in elevated Na+ levels in basic conditions.

      8) The cryo-EM maps showing CP29-Ser84 phosphorylation and CP47-Cys218 sulfinylation are quite convincing. However, it is interesting that these modifications are only observed in the compact conformation, and not in the stretched conformation. Can the authors elaborate on whether or not they believe the compact and stretched conformations could be a result of these posttranslational modifications, or vice versa?

      This is an interesting suggestion. In our opinion it is less likely that the modification themselves trigger the transition between compact and stretched states. It is not clear how these modifications will stabilize the compact vs the stretched states. It is equally likely that these modifications are somehow triggered by the structural change. We cannot be certain that these modifications are not present in the stretched orientation as well but remain unobserved due to resolution differences. The correlation between the states and post translation modifications should be verified before a discussion on their possible roles in the transitions.

      9) Do the authors believe that PSII dimers in the solution can readily interconvert between compact and stretched conformations? Or is the relative ratio of these conformations fixed at the time of membrane solubilization with decyl-maltoside?

      We think that its more probable that the transition between these states occur in the membrane phase. The main reason for this will be that pigment loss and structural transitions in CP29 are more likely to occur in the membrane rather than in aqueous/micelle environments.

      10) The model proposed for divalent cation-mediated stacking of PSII dimers is compelling, and seems to be in agreement with previous investigations that observed a lack of stacked dimers in cryo-EM preparations lacking calcium/magnesium. However, my understanding from reading the methods section is that the observed lack of density between the stacked PSII dimers was inferred from maps obtained after multibody refinement. Based on the way the masks to define bodies were created for multibody refinement (Fig. 4A), the region between stacked dimers would be highly prone to map artifacts following multibody refinement. Have the authors looked closely at the interfacial region between stacked dimers following conventional 3D classification/refinement to ensure that there are indeed no features observed in the interfacial region even at low contour levels?

      We’ve made several attempts to resolve differences in the space between the stacked PSII dimer. These include focused classification with masks containing selected volumes from this regions and masks that include only one of the stacked PSII dimers to avoid signal subtraction in this region. All of these did not reveal any discernible features in this region. In addition, any stable binding of a bridging protein across the stacked dimer will probably be at least partially visible as additional density over the unstacked PSII. We searched for such features and found none.

    1. Author Response:

      Reviewer #1:

      This manuscript by Gabor Tamas' group defines features of ionotropic and metabotropic output from a specific cortical GABAergic cell cortical type, so-called neurogliaform cells (NGFCs), by using electrophysiology, anatomy, calcium imaging and modelling. Experimental data suggest that NGFCs converge onto postsynaptic neurons with sublinear summation of ionotropic GABAA potentials and linear summation of metabotropic GABAB potentials. The modelling results suggest a preferential spatial distribution of GABA-B receptor-GIRK clusters on the dendritic spines of postsynaptic neurons. The data provide the first experimental quantitative analysis of the distinct integration mechanisms of GABA-A and GABA-B receptor activation by the presynaptic NGFCs, and especially gain insights into the logic of the volume transmission and the subcellular distribution of postsynaptic GABA-B receptors. Therefore, the manuscript provides novel and important information on the role of the GABAergic system within cortical microcircuits.

      We have made all changes humanely possible under the current circumstances and we are open to further suggestions deemed necessary.

      Reviewer #2:

      The authors present a compelling study that aims to resolve the extent to which synaptic responses mediated by metabotropic GABA receptors (i.e. GABA-B receptors) summate. The authors address this question by evaluating the synaptic responses evoked by GABA released from cortical (L1) neurogliaform cells (NGFCs), an inhibitory neuron subtype associated with volume neurotransmission, onto Layer 2/3 pyramidal neurons. While response summation mediated by ionotropic receptors is well-described, metabotropic receptor response summation is not, thereby making the authors' exploration of the phenomenon novel and impactful. By carrying out a series of elegant and challenging experiments that are coupled with computational analyses, the authors conclude that summation of synaptic GABA-B responses is linear, unlike the sublinear summation observed with ionotropic, GABA-A receptor-mediated responses.

      The study is generally straightforward, even if the presentation is often dense. Three primary issues worth considering include:

      1) The rather strong conclusion that GABA-B responses linearly summate, despite evidence to the contrary presented in Figure 5C.

      2) Additional analyses of data presented in Figure 3 to support the contention that NGFCs co-activate.

      3) How the MCell model informs the mechanisms contributing to linear response summation.

      These and other issues are described further below. Despite these comments, this reviewer is generally enthusiastic about the study. Through a set of very challenging experiments and sophisticated modeling approaches, the authors provide important observations on both (1) NGFC-PC interactions, and (2) GABA-B receptor mediated synaptic response dynamics.

      The differences between the sublinear, ionotropic responses and the linear, metabotropic responses are small. Understandably, these experiments are difficult – indeed, a real tour de force – from which the authors are attempting to derive meaningful observations. Therefore, asking for more triple recordings seems unreasonable. That said, the authors may want to consider showing all control and gabazine recordings corresponding to these experiments in a supplemental figure. Also, why are sublinear GABA-B responses observed when driven by three or more action potentials (Figure 5C)? It is not clear why the authors do not address this observation considering that it seems inconsistent with the study's overall message. Finally, the final readout – GIRK channel activation – in the MCell model appears to summate (mostly) linearly across the first four action potentials. Is this true and, if so, is the result inconsistent with Figure 5C?

      GABAB responses elicited by three and four presynaptic NGFC action potentials were investigated to have a better understanding about the extremities of NGFC-PC connection. Although, our spatial model suggests that in L1 in a single volumetric point one or two NGFCs could provide GABAB response with their respective volume transmission, it is still important that in the minority of the percentage three or more NGFCs could converge their output. The experiments in Fig 5 not only offer mechanistic understanding that possible HCN channel activation and GABA reuptake do not influence significantly the summation of metabotropic receptor-mediated responses, but also support additional information about the extensive GABAB signaling from more than two NGFC outputs. Interestingly in this experiment the summation until two action potentials show very similar linear integration as seen in the triplet recordings. This result suggests that the temporal and spatial summation is identical when limited inputs are arriving to the postsynaptic target cell. Similar summation interaction can be seen in our model until two consecutive GABA releases. Three or four consecutive GABA releases in our model still produces linear summation, our experiments show moderate sublinearity. One possible answer for this inconsistency is the vesicle depletion in NGFCs after multiple rapid release of GABA, which was not taken into account in our model.

      Presumably, the motivation for Figure 3 is that it provides physiological context for when NGFCs might be coactive, thereby providing the context for when downstream, PC responses might summate. This is a nice, technically impressive addition to the study. However, it seems that a relevant quantification/evaluation is missing from the figure. That is, the authors nicely show that hind limb stimulation evokes responses in the majority of NGFCs. But how many of these neurons are co-active, and what are their spatial relationships? Figure 3D appears to begin to address this point, but it is not clear if this plot comes from a single animal, or multiple? Also, it seems that such a plot would be most relevant for the study if it only showed alpha-actin 2-positive cells. In short, can one conclude that nearby, presumptive NGFCs co-activate, and is this conclusion derived from multiple animals?

      The aim of Fig. 3 D was to indicate that the active, presumably NGFCs are spatially located close to each other. The figure comes from a single animal. We agree with the reviewer, therefore changed the scatter plot figure in Fig. 3D to another one, that provides information about the molecular profiles of the active/inactive cells. We made an effort to further analyze our in vivo data and the spatial localization of the monitored interneurons (see Author response image 3.). The results are from 4 different animals, in these experiments numerous L1 interneurons are active during the sensory stimulus, as shown in the scatter plot. We calculated the shortest distance between all active cells and all ɑ-actinin2+ that were active in experiments. The data suggest that in the case of identified active ɑ-actinin2+ cells, the interneuron somas were on average 182.69+60.54 or 305.135+34.324 μm distance from each other. Data from Fig. 2D indicates that the average axonal arborization of the NGFCs is reaching ~200-250μm away. Taken these two data together, in theory it is probable that the spatial localization would allow neighboring NGFCs to directly interact in the same spatial point.

      The inclusion of the diffusion-based model (MCell) is commendable and enhances the study. Also, the description of GABA-B receptor/GIRK channel activation is highly quantitative, a strength of the study. However, a general summary/synthesis of the observations would be helpful. Moreover, relating the simulation results back to the original motivation for generating the MCell model would be very helpful (i.e. the authors asked whether "linear summation was potentially a result of the locally constrained GABAB receptor - GIRK channel interaction when several presynaptic inputs converge"). Do the model results answer this question? It seems as if performing "experiments" on the model wherein local constraints are manipulated would begin to address this question. Why not use the model to provide some data – albeit theoretical – that begins to address their question?

      We re-formulated the problem to be addressed in this Results section. We admit that our model is has several limitations in the Discussion and, consequently, we restricted its application to a limited set of quantitative comparisons paired to our experimental dataset or directly related to pioneering studies on GABAB efficacy on spines vs shafts. We believe that a proper answer to the reviewer’s suggestion would be worth a separate and dedicated study with an extended set of parameters and an elaborated model.

      In sum, the authors present an important study that synthesizes many experimental (in vitro and in vivo) and computational approaches. Moreover, the authors address the important question of how synaptic responses mediated by metabotropic receptors summate. Additional insights are gleaned from the function of neurogliaform cells. Altogether, the authors should be congratulated for a sophisticated and important study.

      Reviewer #3:

      The authors of this manuscript combine electrophysiological recordings, anatomical reconstructions and simulations to characterize synapses between neurogliaform interneurons (NGFCs) and pyramidal cells in somatosensory cortex. The main novel finding is a difference in summation of GABAA versus GABAB receptor-mediated IPSPs, with a linear summation of metabotropic IPSPs in contrast to the expected sublinear summation of ionotropic GABAA IPSPs. The authors also provide a number of structural and functional details about the parameters of GABAergic transmission from NGFCs to support a simulation suggesting that sublinear summation of GABAB IPSPs results from recruitment of dendritic shaft GABAB receptors that are efficiently coupled to GIRK channels.

      I appreciate the topic and the quality of the approach, but there are underlying assumptions that leave room to question some conclusions. I also have a general concern that the authors have not experimentally addressed mechanisms underlying the linear summation of GABAB IPSPs, reducing the significance of this most interesting finding.

      1) The main novel result of broad interest is supported by nice triple recording data showing linear summation of GABAB IPSPs (Figure 4), but I was surprised this result was not explored in more depth.

      We have chosen the approach of studying GABAB-GABAB interactions through the scope of neurogliaform cells and explored how neurogliaform cells as a population might give rise to the summation properties studied with triple recordings. This was a purposeful choice admittedly neglecting other possible sources of GABAB-GABAB interactions which possibly take place during high frequency coactivation of homogeneous or heterogeneous populations of interneurons innervating the same postsynaptic cell. We agree with the reviewer that the topic of summation of GABAB IPSPs is important and in-depth mechanistic understanding requires further separate studies.

      2) To assess the effective radius of NGFC volume transmission, the authors apply quantal analysis to determine the number of functional release sites to compare with structural analysis of presynaptic boutons at various distances from PC dendrites. This is a powerful approach for analyzing the structure-function relationship of conventional synapses but I am concerned about the robustness of the results (used in subsequent simulations) when applied here because it is unclear whether volume transmission satisfies the assumptions required for quantal analysis. For example, if volume transmission is similar to spillover transmission in that it involves pooling of neurotransmitter between release sites, then the quantal amplitude may not be independent of release probability. Many relevant issues are mentioned in the discussion but some relevant assumptions about QA are not justified.

      Indeed, pooling of neurotransmitter between release sites may affect quantal amplitude, therefore we examined quantal amplitude under low release probability conditions using 0.7- 1.5 mM [Ca]o to detect postsynaptic uniqantal events initiated by neurogliaform cell activation (Author response image 7). This way we measured similar quantal current amplitudes comparing with BQA method with no significant difference (4.46±0.83 pA, n=4, P=0.8, Mann-Whitney Test).

      3) The authors might re-think the lack of GABA transporters in the model since the presence and characteristics of GATs will have a large effect on the spread of GABA in the extracellular space.

      We agree that the presence of GAT could effectively shape the GABA exposure, e.g. (Scimemi 2014). During the development of the model, we took into consideration different possibilities and solutions to create the model’s environment. To our knowledge, there is no detailed electron microscopic study that would provide ultrastructural measurements of structural elements around the NGFC release sites and postsynaptic pyramidal cell dendrites in layer 1 while preserving the extracellular space. Moreover, quantitative information is scarce about the exact localization and density of the GATs along the membrane surface of glial processes around confirmed NGFC release sites. We felt that developing a functional environment that would contain GABA transporters without possessing such information would be speculative. Furthermore, during the development of the model it became clear that incorporating thousands of differentially located GABA transporters would massively increase the processing time of single simulations including monitoring each interaction between GATs and GABA molecules, and requiring computational power calculating the diffusion of GABA molecules in the extracellular space, even if GABA molecules are far from the postsynaptic dendritic site without any interaction.

      As an admittedly simple and constrained alternative, we decided to set a decay half-life for the GABA molecules released. This approach allows us to mimic the GABA exposure time of 20-200 ms, based on experimental data (Karayannis et al 2010). In the model the GABA exposure time was 114.87 ± 2.1 ms with decay time constants of 11.52 ± 0.14 ms. After ~200 ms all the released GABA molecules disappeared from the simulation environment.

      A detailed extracellular diffusion aspect was out of the scope of our model, we were interested in investigating how the subcellular localization of receptors and channels determine the summation properties.

      4) I'm not convinced that the repetitive stimulation protocol of a single presynaptic cell shown (Figure 5) is relevant for understanding summation of converging inputs (Figure 4), particularly in light of the strong use-dependent depression of GABA release from NGFCs. It is also likely that shunting inhibition contributes to sublinear summation to a greater extent during repetitive stimulation than summation from presynaptic cells that may target different dendritic domains. The authors claim that HCN channels do not affect integration of GABAB IPSPs but one would not expect HCN channel activation from the small hyperpolarization from a relatively depolarized holding potential.

      Use-dependent synaptic depression of NGFC induced postsynaptic responses was nicely documented by Karayannis and coworkers (2010) although they investigated the GABAA component of the responses and they found that the depression is caused by the desensitization of postsynaptic GABAA receptors. We are not aware of experiments published on the short term plasticity of GABAB responses. In our experiments represented in Fig 5 we found linearity in the summation of GABAB responses up to two action potentials and sublinearity for 3 and 6 action potentials. In fact, our results show that no synaptic depression is detectable in response to paired pulses since amplitudes of the voltage responses were doubled compared to a single pulse which means that the paired pulse ratio is around 1. To verify our result, we repeated our dual recording measurements with one, two, three and four spike initiation in the presynaptic neurogliaform cell (Author response image 6). Measuring both the amplitude and the overall charge of GABAB responses we again found linear relationship among one and two spike initiation protocol.

      Author response image 6 - Integration of GABAB receptor-mediated synaptic currents (A) Representative recording of a neurogliaform synaptic inhibition on a voltage clamped pyramidal cell. Bursts of up to four action potentials were elicited in NGFCs at 100 Hz in the presence of 1 μM gabazine and 10 μM NBQX (B) Summary of normalized IPSC peak amplitudes (left) and charge (right). (C) Pharmacological separation of neurogliaform initiated inhibitory current.

    1. Author Response:

      Reviewer #1:

      The paper uses a microfluidic-based method of cell volume measurement to examine single cell volume dynamics during cell spreading and osmotic shocks. The paper successfully shows that the cell volume is largely maintained during cell spreading, but small volume changes depend on the rate of cell deformation during spreading, and cell ionic homeostasis. Specifically, the major conclusion that there is a mechano-osmotic coupling between cell shape and cell osmotic regulation, I think, is correct. Moreover, the observation that fast deforming cell has a larger volume change is informative.

      The authors examined a large number of conditions and variables. It's a paper rich in data and general insights. The detailed mathematical model, and specific conclusions regarding the roles of ion channels and cytoskeleton, I believe, could be improved with further considerations.

      We thank the referee for the nice comment on our work and for the detailed suggestions for improving it.

      Major points of consideration are below.

      1) It would be very helpful if there is a discussion or validation of the FXm method accuracy. During spreading, the cell volume change is at most 10%. Is the method sufficiently accurate to consider 5-10% change? Some discussion about this would be useful for the reader.

      This is an important point and we are sorry if it was not made clear in our initial manuscript. We have now made it more clear in the text (p. 4 and Figure S1E and S1F).

      The important point is that the absolute accuracy of the volume measure is indeed in the 5 to 10% range, but the relative precision (repeated measures on the same cell) is much higher, rather in the 1% range, as detailed below based on experimental measures.

      1) Accuracy of absolute volume measurements. The accuracy of the absolute measure of the volume depends on several parameters which can vary from one experiment to the other: the exact height of the chamber, and the biological variability form one batch of cell to another (we found that the distribution of volumes in a population of cultured cells depends strongly on the details of the culture – seeding density, substrate, etc... - which we normalized as much as possible to reduce this variability, as described in previous articles, e.g. see2). To estimate this variability overall, the simplest is to compare the average volume of the cell population in different experiments, carried out in different chambers and on different days.

      Graph showing the initial average volume of cells +/- STD for 7 spreading experiments and 27 osmotic shock experiments, expressed as a % deviation from the average volume over all the experiments.

      The average deviation is of 10.9 +/- 8%

      2) Precision of relative volume measurements. When the same cell is imaged several times in a time-lapse experiment, as it is spreading on a substrate, or as it is swelling or shrinking during an osmotic shock, most of the variability occurring from one experiment to another does not apply. To experimentally assess the precision of the measure, we performed high time resolution (one image every 30 ms) volume measurements of 44 spread cells during 9 s. During this period of time, the volume of the cell should not change significantly, thus giving the precision of the measure.

      Graph showing the coefficient of variation of the volume (STD/mean) for each individual cell (n=44) across the almost 300 frames of the movie. This shows that on average the precision of volume measurements for the same cell is 0.97±0.21%. In addition, if more precision was needed, averaging several consecutive measures can further reduce the noise, a method which is very commonly used but that we did not have to apply to our dataset.

      We have included these results in the revised manuscript, since they might help the reader to estimate what can be obtained from this method of volume measurement. We also point the reviewer to previous research articles using this method and showing both population averages and time-lapse data2–8 . Another validation of our volume measurement method comes from the relative volume changes in response to osmotic shock (Ponder’s relation) measured with FXm, which gave results very similar to the numbers of previously published studies. We actually performed these experiments to validate our method, since the results are not novel.

      2) The role of cell active contraction (myosin dynamics) is completely neglected. The membrane tether tension results, LatA and Y-compound results all indicate that there is a large influence of myosin contraction during cell spreading. I think most would not be surprised by this. But the model has no contribution from cortical/cytoskeletal active stress. The authors are correct that the osmotic pressure is much larger than hydraulic pressure, which is related to active contraction. But near steady state volume, the osmotic pressure difference must be equal to hydraulic pressure difference, as demanded by thermodynamics. Therefore, near equilibrium they must be close to each other in magnitude. During cell spreading, water dynamics is near equilibrium (given the magnitude of volume change), and therefore is it conceptually correct to neglect myosin active contraction? BTW, 1 solute model does not imply equal osmolarity between cytoplasm and external media. 1 solute model with active contraction was considered before, e.g., ref. 17 and Tao, et al, Biophys. J. 2015, and the steady state solution gives hydraulic pressure difference equal to osmotic pressure difference.

      This is an excellent point raised by the referee. We have two types of answers for this. First an answer from an experimental point of view, which shows that acto-myosin contractility does not seem to play a direct role in the control of the cell volume, at least in the cells we used here. Based on these results we then propose a theoretical reason why this is the case. It contrasts with the view proposed in the articles mentioned by the referee for a reason which is not coming from the physical principles, with which we fully agree, but from the actual numbers, available in the literature, of the amount of the various types of osmolytes inside the cell. We give these points in more details below and we hope they will convince the referee. We also now mention them explicitly in the main text of the article (p. 6-7, Figure S3F) and in the Supplementary file with the model.

      A. Experimental results

      To test the effect of acto-myosin contraction on cell volume, we performed two experiments:

      1) We measured the volume of same cell before and after treatment with the Rho kinase ROCK inhibitor Y-27632, which decreases cortical contractility. The experiment was performed on cells plated on poly-L-Lysin (PLL), like osmotic shock experiments, a substrate on which cells adhere, allowing the change of solution, but do not spread and remain rounded. This allowed us to evaluate the effect of the drug. Cells were plated on PLL-coated glass. The change of medium itself (with control medium) induced a change of volume of less than 2%, similar to control osmotic shock experiments (maybe due to shear stress). When the cells were treated with Y-27, the change of volume was similar to the change with the control medium (now commented in the text p. 6-7, Figure S3F). To make the analysis more complete, we distinguished the cells that remained round throughout the experiment from the cells which slightly spread, since spreading could have an effect on volume. Indeed we observed that treatment with Y-27 induced more cells to spread (Figure S3F), probably because the cortex was less tensed, allowing the adhesive forces on PLL to induce more spreading9. Nevertheless, the spreading remained rather slow and the volume change of cells treated or not with Y-27 was not significantly different. This shows that, in the absence of fast spreading induced by Y-27, the reduction of contractility per se does not have any effect on the cell volume.

      Graphs showing proportion of cells that spread during the experiments (left); average relative volume of round (middle) and spread (right) control (N=3, n=77) and Y-27 treated cells (N=4, N=297).

      2) To evaluate the impact of a reduction of contractility in the total absence of adhesion, we measured the average volume of control cells versus cells which have been pretreated with Y-27, plated on a non-adhesive substrate (PLL-PEG treatment). This experiment showed that the volume of the cells evolved similarly in time for both conditions, proving that contractility per se has no effect on the cell volume or cell growth, in the absence of spreading.

      Graphs showing average relative volume of control (N=5, n=354) and Y-27 (N=3, n=292) treated cells plated on PLL-PEG (left); distributions of initial volume for control (middle) and Y-27 treated cells (right) represented on the left graph.

      Taken together these results show that inhibition of contractility per se does not significantly affect cell volume. It thus confirms our interpretation of our results on cell spreading that reduction of contractility has an effect on cell volume, specifically in the context of cell spreading, primarily because it affects the spreading speed.

      B. Theoretical interpretation

      In accordance with our experiments, in our model, the effect of contractility is implicitly included in the model because it modulates the spreading dynamics, which is an input to the model, i.e. through the parameters tau_a and A_0.

      We do not include the effect of contractility directly in the water transport equation because our quantitative estimates support that the contribution of the hydrostatic pressure to the volume (or the volume change) is negligible in comparison to the osmotic pressure, and this even for small variation near the steady-state volume. The main important point is that the concentration of ions inside the cell is actually much lower than outside of the cell10,11. The difference is about 100 mM and corresponds mostly to nonionic small trapped osmolytes, such as metabolites12. The osmotic pressure corresponding to this is about 10^5 Pa. Taking the cortical tension to be of order of 1 mN/m and cell size to be about ten microns we get a hydrostatic pressure difference of about 100 Pa due to cortical tension. A significant change in cell volume, of the order observed during cell spreading (let’s consider a ten percent decrease) will increase the osmotic pressure of the trapped nonionic osmolytes by 10^4 Pa (their number in the cell remaining identical). For this osmotic pressure to be balanced by an increase in the hydrostatic pressure, the cortical tension would need to increase by a factor of 100, which we consider to be unrealistic. Therefore, we find it reasonable to ignore the contribution of the hydrostatic pressure difference in the water flux equation. It is also consistent with the novel experiments presented above which show that inhibition of cortical contractility changes the cells volume below what can be detected by our measures (thus likely at maximum in the 1% range). This is now explained in the main text and Supplementary file.

      Regarding our minimal model required to define cell volume, the reason why we believe one solute model is not sufficient is fundamentally the same as above: the concentration of trapped osmolytes is comparable to the total osmolarity, which means that their contribution to the total osmotic pressure cannot be discarded. Secondly, within the simplest one solute model, the pump and leak dynamics fixes in inner osmolytes concentration but does not involve the actual cell size. The most natural term that depends on the size is the Laplace pressure (inversely proportional to the cell size in a spherical cell model). But as discussed above, this term may only permit osmotic pressure differences of the order of 100 Pa, corresponding to an osmolytes concentration difference of the order of 0.1 mM. That is only a tiny fraction of the external medium osmolarity, which is about 300 mM. Such a model could thus only work for extremely fine tuning of the pump and leak rates to values with less than about 1% variation. Furthermore, such a model could not explain finite volume changes upon osmotic shocks without involving huge (100-fold) cell surface tension variations, as discussed above. For these reasons, we believe that the one-solute model is not appropriate to describe our experiments, and we feel that a trapped population of nonionic osmolytes is needed to balance the osmolarity difference created by the solute pump and leak.

      In the revised version of the manuscript, we have now added a section in Supplementary file and in the main text, explaining in more detail this approximation.

      3) The authors considered the role of Na, K, and Cl in the model, and used pharmacological inhibitors of NHE exchanger. I think this part of the experiments and model are somewhat weak. I am not sure the conclusions drawn are robust. First there are many ion channels/pumps in regulating Na, K and Cl. The most important of which is NaK exchanger. NHE also involves H, and this is not in the model. The ion flux expressions in the model are also problematic. The authors correctly includes voltage and concentration dependences, but used a constant active term S_i in SM eq. 3 for active pumping. I am not sure this is correct. Ion pump fluxes have been studied and proposed expressions based on experimental data exist. A study of Na, K, Cl dynamics, and membrane voltage on cell volume dynamics was published in Yellen et al, Biophys. J. 2018. In that paper, they used different expressions based on previously proposed flux expressions. It might be correct that in small concentration differences, their expressions can be linearized or approximated to achieve similar expressions as here. But this point should be considered more carefully.

      We thank the reviewer for this comment. Indeed, we have not well justified our use of the NHE inhibitor EIPA. Our aim was not to directly affect the major ion pumps involved in volume regulation (which would indeed rather be the Na+/K+ exchanger), because that would likely strongly impact the initial volume of the cell and not only the volume response to spreading, making the interpretation more difficult. We based our choice on previous publication, e.g.13, showing that EIPA inhibited the main fast volume changes previously reported for cultured cells: it was shown to inhibit volume loss in spreading cells, as well as mitotic cell swelling14,15. Using EIPA, we also found that, while the initial volume was only slightly affected, the volume loss was completely abolished even in fast spreading cells (Y-27 and EIPA combined treatment, Figure S5H). This clearly proves that the volume loss behavior can be abolished, without changing the speed of spreading, which was our main aim with this experiment.

      The most direct effect of inhibiting NHE exchangers is to change the cell pH16,17, which, given the low number of H protons in the cell (negligible contribution to cells osmotic pressure), cannot affect the cell volume directly. A well-studied mechanism through which proton transport can have indirect effect on cell volume is through the effect of pH on ion transporters or due to the coupling between NHE and HCO3/Cl exchanger. The latter case is well studied in the literature18. In brief, the flux of proton out of the cell through the NHE due to Na gradient leads to an outflux of HC03 and an influx of Cl. The change in Cl concentration will have an effect on the osmolarity and cell volume.

      We thus performed hyperosmotic shocks with this drug and we found that, as expected, it had no effect on the immediate volume change (the Ponder’s relation), but affected the rate of volume recovery (combined with cell growth). Overall, the cells treated with EIPA showed a faster volume increase, which is what is expected if active pumping rate is reduced. This is in contrast with the above mentioned mechanism of volume regulation which will to lead to a reduced volume recovery of EIPA treated cells. This leads us to conclude that there is potentially another effect of NHE perturbation. Changing the pH will have a large impact on the functioning of many other processes, in particular, it can have an effect on ion transport16. Overall, the cells treated with EIPA showed a faster volume increase, which is what is expected if active pumping rate is reduced.

      On the model side, the referee correctly points out that there are many ion transporters that are known to play a role in volume regulation which are not included in Eq. 3. In the revised manuscript we now start with a more general ion transport equation. We show that the main equation (Eq.1 - or Supplementary file Eq.13) relating volume change to tension is not affected by this generalization. This is because we consider only the linear relation between the small changes in volume and tension. We note that the generic description of the PML (Supplementary file Eqs.1-6) can be seen as general and does not require the pump and channel rates to be constant; both \Lambda_i and S_i can be a function of potential and ion concentration along with membrane tension. It is only later in the analysis that we do make the assumption that these parameters only depend on tension. This point is now made clear in the Supplementary file.

      There is a huge body of work both theoretical and experimental in which the effect of different ion transporters on cell volume is analyzed. The aim of this work is not to provide an analysis of cell volume and the effect of various co-transporters but is rather limited to understanding the coupling between cell spreading, surface tension and cell volume.

      To analytically estimate the sign of the mechano-osmotic coupling parameter alpha we use a minimal model. For this we indeed take the pumps and channels to be constant. As it is again a perturbative expansion around the steady state concentration, electric potential, and volume, the expression of alpha can be easily computed for a model with more general ion transporters. This generalization will come at the cost of additional parameters in the alpha expression. We decided to keep the simpler transport model, the goal of this estimate is merely to show that the sign of alpha is not a given and depends on relative values of parameters. Even for the simple model we present, the sign of alpha could be changed by varying parameters within reasonable ranges.

      Given these points, and the clarification of the reasons to use EIPA in our experiments, a full mechanistic explanation of the effect of this drug is beyond the scope of this work. Because of this we are not analyzing the effect of EIPA on the model parameter alpha in detail. We now clarified our interpretation of these results in the main text of the article.

      Reviewer #2:

      The work by Venkova et al. addresses the role of plasma membrane tension in cell volume regulation. The authors study how different processes that exert mechanical stress on cells affect cell volume regulation, including cell spreading, cell confinement and osmotic shock experiments. They use live cell imaging, FXm (cell volume) and AFM measurements and perform a comparative approach using different cell lines. As a key result the authors find that volume regulation is associated with cell spreading rate rather than absolute spreading area. Pharmacological assays further identified Arp2/3 and NHE1 as molecular regulators of volume loss during cell spreading. The authors present a modified mechano-osmotic pump and leak model (PLM) based on the assumption of a mechanosensitive regulation of ion flux that controls cell volume.

      This work presents interesting data and theoretical modelling that contribute new insight into the mechanisms of cell volume regulation.

      We thank the referee for the nice comments on our work. We really appreciate the effort (s)he made to help us improve our article, including the careful inspection of the figures. We think our work is much improved thanks to his/her input.

      Reviewer #3:

      The study by Venkova and co-workers studies the coupling between cell volume and the osmotic balance of the cell. Of course, a lot of work as already been done on this subject, but the main specific contribution of this work is to study the fast dynamics of volume changes after several types of perturbations (osmotic shocks, cell spreading, and cell compression). The combination of volume dynamics at very high time resolution, and the robust fits obtained from an adapted Pump and Leak Model (PLM) makes the article a step-forward in our understanding of how cell volume is regulated during cell deformations. The authors clearly show that:

      -The rate at which cell deforms directly impacts the volume change

      -Below a certain deformation rate (either by cell spreading or external compression), the cells adapt fast enough not to change their volume. The plot dV/dt vs dA/dt shows a clear proportionality relation.

      -The theoretical description of volume change dynamics with the extended PLM makes the overall conclusions very solid.

      Overall the paper is very well written, contains an impressive amount of quantitative data, comparing several cell types and physiological and artificial conditions.

      We thank the referee for the positive comment on our work.

      My main concern about this study is related to the role of membrane tension. In the PLM model, the coupling of cell osmosis to cell deformation is made through the membrane-tension dependent activity of ion channels. While the role of ion channels is extensively tested, it brings some surprising results. Moreover, the tension is measured only at fixed time points, and the comparison to theoretical predictions is not always as convincing as expected: when comparing fig 6I and 6J, I see that predictions shows that EIPA (+ or - Y27), CK-666 (+ or - Y27) and Y27 alone should have lower tension than in the control conditions, and this is clearly not the case in fig 6J. But I would not like to emphasize too much on those discrepancies, as the drugs in the real case must have broad effects that may not be directly comparable to the theory.

      We apologize for the mislabeling of the Figure 6I (now Figure 5I). This plot shows the theoretical estimate for the difference in tension (in the units of homeostatic tension) between the case when the cell loses its volume upon spreading (as observed in experiments) compared to the hypothetical situation when the cell does not lose volume upon spreading (alpha = 0). The positive value of the tension difference predicts that the cell tension would have been higher if the cell were not losing volume upon spreading, which is the case for the treatments with EIPA and CK-666 (+ Y27) and corresponds to what we found experimentally.

      It thus matches our experimental observations for drug treatments which reduce or abolish the volume loss during spreading and correspond to higher tether force only at short time.

      We have corrected the figure and figure legend and explained it better in the text.

      But I wonder if the authors would have a better time showing that the dynamics of tension are as predicted by theory in the first place, as comparing theoretical predictions with experiments using drugs with pleiotropic effects may be hazardous.

      Actually, a recent publication (https://doi.org/10.1101/2021.01.22.427801) shows that tension follows volume changes during osmotic shocks, and overall find the same dynamics of volume changes than in this manuscript. I am thus wondering if the authors could use the same technique than describe in this paper (FLIM of flipper probe) in order to study the dynamics of tension in their system, or at least refer to this paper in order to support their claim that tension is the coupling factor between volume and deformation.

      As was suggested by the referee, we tried to use the FLIPPER probe. We first tried to reproduce osmotic shock experiments adding to the HeLa cells 4% of PEG400 (+~200 mOsm) or 50% of H20 (-~170 mOsm) and measuring the average probe lifetime before and after the shock. We found significantly lower probe lifetime for hyperosmotic condition compared with control, and non-significant, but slightly higher lifetime for hypoosmotic shock. The magnitude of lifetime changes was comparable with the study cited by the reviewer, but the quality of our measures did not allow us to have a better resolution. Next we measured average lifetime for control and CK-666+Y-27 treated cells 30 min and 3 h after plating, because we have highest tether force values for CK-666+Y-27 at 30 min. We did not see a change in lifetime in control cells between 30 min and 3 h (which also did not see with the tether pulling). Cells treated with CK-666+Y-27 showed a slightly lower lifetime values than control cells, but both 30 min and 3 h after plating, which means that it did not correspond to the transient effect of fast spreading but probably rather to the effect of the drugs on the measure.

      Graph showing FLIPPER lifetime before and after osmotic shock for HeLa cells plated on PLL- coated substrate. Left: control (N=3, n=119) and hyperosmotic shock (N=3, n=115); Right: control (N=3, n=101) and hypoosmotic shock (N=3, n=80). p-value are obtained by t-test.

      Graph showing FLIPPER lifetime for control just after the plating on PLL-coated glass (the same data for control shown at the previous graph), 30 min (control: N=3, n=88; Y-27+CK-666: N=3, n=130) and 3 h (control: N=3, n=78; Y-27+CK-666: N=3, n=142) after plating on fibronectin-coated glass. p-value are obtained by t-test.

      Because the cell to cell variability might mask the trend of single cell changes in lifetime during spreading, we also tried to follow the lifetime of individual cells every 5 min along the spreading. Most illuminated cells did not spread, while cells in non-illuminated fields of view spread well, suggesting that even with an image every 5 minutes and the lowest possible illumination, the imaging was too toxic to follow cell spreading in time. We could obtain measures for a few cells, which did not show any particular trend, but their spreading was not normal. So we cannot really conclude much from these experiments.

      Graph showing FLIPPER lifetime changes for 3 individual cells plated on fibronectin-coated glass (shown in blue, magenta and green) and average lifetime of cells from non-illuminated field (cyan, n=7)

      Our conclusions are the following:

      1) We are able to visualize some change in the lifetime of the probe for osmotic shock experiments, similar to the published results, but with a rather large cell to cell variability.

      2) The spreading experiments comparing 30 minutes and 3 hours, in control or drug treated cells did not reproduce the results we observed with tether pulling, with a global effect of the drugs on the measures at both 30 min and 3 hours.

      3) Following single cells in time led to too much toxicity and prevented normal spreading.

      We think that this technology, which is still in its early developments, especially in terms of the microscope setting that has to be used (and we do not have it in our Institute, so we had to go on a platform in another institute with limited time to experiment), cannot be implemented in the frame of the revision of this article to provide reliable results. We thus consider that these experiments are for further development of the work and are out of the scope of this study. It would be very interesting to study in details the comparison between the oldest and more established method of tether pulling and the novel method of the FLIPPER probe, during cell spreading and in other contexts. To our knowledge this has never been done so far, so it is not in the frame of this study that we can do it. It is not clear from the literature that the two methods would measure the same thing in all conditions even if they might match in some.

    1. Author Response

      Reviewer #1 (Public Review):

      “A sample size of 3 idiopathic seems underpowered relative to the many types of genetic changes that can occur in ASD. Since the authors carried out WGS, it would be useful to know what potential causative variants were found in these 3 individuals and even if not overlapping if they might expect to be in a similar biological pathway.

      If the authors randomly selected 3 more idiopathic cell lines from individuals with autism, would these cell lines also have altered mTOR signaling? And could a line have the same cell biology defects without a change in mTOR signaling? The authors argue that the sample size could be the reason for lack of overlap of the proteomic changes (unlike the phosphor-proteomic overlaps), which makes the overlapping cell biology findings even more remarkable. Or is the phenotyping simply too crude to know if the phenotypes truly are the same?”

      We appreciate these thoughtful comments and also agree that of several models, our studies indicate the possibility of mTOR alteration in multiple forms of ASD. As above, we are currently pursuing this hypothesis with newly acquired DOD support. With regard to the I-ASD population, we agree that there are a large variety of genetic changes that can occur in genetically undefined ASDs. Indeed, this is precisely why we expected to see “personalized” phenotypes in each I-ASD individual when we embarked on this study. At that time, several years ago, we had planned to expand the analyses to more I-ASD individuals to assess for additional personalized phenotypes. However, as our studies progressed, we were surprised to find convergence in our I-ASD population in terms of neurite outgrowth and migration and later proteomic results showing convergence in mTOR. We found it particularly remarkable that despite a sample size of 3 that this convergence was noted. When we had the opportunity to extend our studies to the 16p11.2 deletion population, we were thrilled to conduct the first comparison between I-ASD and a genetically defined ASD and, as such, the scope of the paper turned towards this comparison. We do agree that analyses of the other I-ASD individuals would be a beneficial endeavor, both to understand how pervasive NPC migration and neurite deficits are in autism and to assess the presence of mTOR dysregulation. Furthermore, it would be important to see whether alterations in other pathways could also lead to similar cell biological deficits, though we know that other studies of neurodevelopmental disorders have found such cellular dysregulations without reporting concurrent mTOR dysregulation. Given our current grant funding to extend these analyses, such experiments within this manuscript would not be feasible.

      Regarding the phenotyping methods used, we decided to assess neurite outgrowth and migration as they are both cytoskeleton dependent processes that are critical for neurodevelopment and are often regulated by the same genes. Furthermore, similar analyses have been applied to Fragile-X Syndrome, 22q11.2 deletion syndrome, and schizophrenia NPCs (Shcheglovitov A. et al., 2013; Mor-Shaked H. et al., 2016; Urbach A. et al., 2010; Kelley D. J. et al., 2008; Doers M. E. et al., 2014; Brennand K. et al., 2015; Lee I. S. et al., 2015; Marchetto M. C. et al., 2011). As such, it seems that multiple underlying etiologies can lead to similar dysregulated cellular phenotypes that can contribute to a variety of neurodevelopmental disorders. On a more global level, there are only a few different cellular functions a developing neuron can undergo, and these include processes such as proliferation, survival, migration, and differentiation. Thus, to understand neurodevelopmental disorders, it is important to study the more “crude” or “global” cellular functions occurring during neurodevelopment to determine whether they are disrupted in disorders such as ASD. In our studies we find that there are indeed dysregulations in many of these basic developmental processes, indicating that the typical steps that occur for normal brain cytoarchitecture may be disrupted in ASD. To understand why, we then further utilized molecular studies to “zoom” in on potential mechanisms which implicated common dysregulation in mTOR signaling as one driver for these common cellular phenotypes. As suggested, we did complete WGS on all the I-ASD individuals and did not see any overlapping genetic variants between the three I-ASD individuals as mentioned in our manuscript. The genetic data was published in a larger manuscript incorporating the data (Zhou A. et al., 2023). However, there were variants that were unique to each I-ASD individual which were not seen in their unaffected family members, and it is possible these variants could be contributing to the I-ASD phenotypes. We also utilized IPA to conduct pathway analysis on the WGS data utilizing the same approach we did in analysis of p- proteome and proteome data. From WGS data, we selected high read-quality variants that were found only in I-ASD individuals and had a functional impact on protein (ie excluding synonymous variants). The enriched pathways obtained from this data were strikingly different from the pathways we found in the p-proteome analysis and are now included in supplemental Figure 6 in the manuscript. Briefly, the top 5 enriched pathways were: O-linked glycosylation, MHC class 1 signaling, Interleukin signaling, Antigen presentation, and regulation of transcription.

      Reviewer #2 (Public Review):

      1) I found that interpreting how differential EF sensitivity is connected to the rest of the story difficult at times. First, it is unclear why these extracellular factors were picked. These are seemingly different in nature (a neuropeptide, a growth factor and a neuromodulator) targeting largely different pathways. This limits the interpretation of the ASD subtype-specific rescue results. One way of reframing that could help is that these are pro-migratory factors instead of EFs broadly defined that fail to promote migration in I-ASD lines due to a shared malfunctioning of the intracellular migration machinery or cell-cell interactions (possibly through tight junction signaling, Fig S2A). Yet, this doesn't explain the migration/neurite phenotypes in 16p11 lines where EF sensitivity is not altered, overall implying that divergent EF sensitivity independent of underlying mTOR state. What is the proposed model that connects all three findings (divergent EF sensitivity based on ASD subtypes, 2 mTOR classes, convergent cellular phenotypes)?

      We thank you for the kind assessment of our manuscript and for the thought-provoking questions posed. In terms of extracellular factors, for our study, we defined extracellular factor as any growth factor, amino acid, neurotransmitter, or neuropeptide found in the extracellular environment of the developing cells. The EFs utilized were selected due to their well-established role in regulation of early neurodevelopmental phenotypes, their expression during the “critical window” of mid-fetal development (as determined by Allan Brain Atlas), and in the case of 5-HT, its association with ASD (Abdulamir H. A. et al., 2018; Adamsen D. et al., 2014; Bonnin A. et al., 2011; Bonnin A. et al., 2007; Chen X. et al., 2015; El Marroun H. et al., 2014; Hammock E. et al., 2012; Yang C. J. et al., 2014; Dicicco-Bloom E. et al., 1998; Lu N. et al., 1998; Suh J. et al., 2001; Watanabe J. et al., 2016; Gilmore J. H. et al., 2003; Maisonpierre P. C. et al., 1990; Dincel N. et al., 2013; Levi- Montalcini R., 1987). Lastly, prior experiments in our lab with a mouse model of neurodevelopmental disorders, had shown atypical responses to EFs (IGF-1, FGF, PACAP). As such, when we first chose to use EFs in human NPCs we wanted to know 1) whether human NPCs even responded to these EFs, 2) whether EFs regulated neurite outgrowth and migration and 3) would there be a differential response in NPCs derived from those with ASD. Our studies were initiated on the I-ASD cohort and given the heterogeneity of ASD we had hypothesized we would get “personalized” neurite and migration phenotypes. Due to this reason, we also wanted to select multiple types of EFs that worked on different signaling pathways. Ultimately, instead of personalized phenotypes we found that all the I-ASD NPCs did not respond to any of the EFs tested whereas the 16p11.2 deletion NPCS did – this was therefore the only difference we found between these two “forms” of ASD. As noted, in I-ASD the lack of response to EFs can be ameliorated by modulating mTOR. However, in the 16p11.2 deletion, despite similar mTOR dysregulation as seen in I-ASD, there is no EF impairment. We do not have a cohesive model to explain why the 16pDel individuals differ from the I-ASD model other than to point to the p- proteomes which do show that the 16pDel NPCs are distinct from the I-ASD NPCs. It seems that mTOR alteration can contribute to impaired EF responsiveness in some NPCs but perhaps there is an additional defect that needs to be present in order for this defect to manifest, or that 16p11.2 deletion NPCs have specific compensatory features. For example, as noted in the thoughtful comment, the p-proteome canonical pathway analysis shows tight junction malfunction in I-ASD which is not present in the 16pDel NPCs and it could be the combination of mTOR dysregulation + dysregulated tight junction signaling that has led to lack of response to EFs in I-ASD. Regardless, we do not think the differences between two genetically distinct ASDs diminish the convergent mTOR results we have uncovered. That is, regardless of whatever defects are present in the ASD NPCs, we are able to rescue it with mTOR modulation which has fascinating implications for treatment and conceptualization for ASD. Lastly, we see our EF studies as an important inclusion as it shows that in some subtypes of ASD, lack of response to appropriate EFs could be contributing to neurodevelopmental abnormalities. Moreover, lack of response to these EFs could have implications for treatment of individuals with ASD (for example, SSRI are commonly used to treat co-morbid conditions in ASD but if an individual is unresponsive to 5- HT, perhaps this treatment is less effective). We have edited the manuscript to include an additional discussion section to address the EFs more thoroughly and have included a few extra sentences in the introduction as well!

      2) A similar bidirectional migration phenotype has been described in hiSPC-derived human cortical interneurons generated from individuals with Timothy Syndrome (Birey et al 2022, Cell Stem Cell). Here, authors show that the intracellular calcium influx that is excessive in Timothy Syndrome or pharmacologically dampened in controls results in similar migration phenotypes. Authors can consider referring to this report in support of the idea that bimodal perturbations of cardinal signaling pathways can converge upon common cellular migration deficits.

      We thank you for pointing out the similar migration phenotype in the Timothy Syndrome paper and have now cited it in our manuscript. We have also expanded on the concept of “too much or too little” of a particular signaling mechanism leading to common outcomes.

      3) Given that authors have access to 8 I-ASD hiPSC lines, it'd very informative to assay the mTOR state (e.g. pS6 westerns) in NPCs derived from all 8 lines instead of the 3 presented, even without assessing any additional cellular phenotypes, which authors have shown to be robust and consistent. This can help the readers better get a sense of the proportion of high mTOR vs low- mTOR classes in a larger cohort.

      We have already addressed this in response to reviewer 1 and the essential revisions section, providing our reasoning for not expanding the study to all 8 I-ASD individuals.

      4) Does the mTOR modulation rescue EF-specific responses to migration as well (Figure 7)

      We did not conduct sufficient replicates of the rescue EF specific responses to migration due to the time consuming and resource intensive nature of the neurosphere experiments. Unlike the neurite experiments, the neurosphere experiments require significantly more cells, more time, selection of neurospheres based on a size criterion, and then manual trace measurements. We did one experiment in Family-1 where we utilized MK-2206 to abolish the response of Sib NPCs to PACAP. Likewise, adding SC-79 to I-ASD-1 neurospheres allowed for response to PACAP.

      Author response image 1.

      Author response image 2.

      Reviewer #3: Public Review

      We appreciate the kind, detailed and very thorough review you provided for us!

      The results on the mTOR signaling pathway as a point of convergence in these particular ASD subtypes is interesting, but the discussion should address that this has been demonstrated for other autism syndromes, and in the present manuscript, there should be some recognition that other signaling pathways are also implicated as common factors between the ASD subtypes.

      With regards to the mTOR pathway, we had included the other ASD syndromes in which mTOR dysregulation has been seen including tuberous sclerosis, Cowden Syndrome, NF-1, as well as Fragile-X, Angelman, Rett and Phelan McDermid in the final paragraph of the discussion section “mTOR Signaling as a Point of Convergence in ASD”. We have now expanded our discussion to include that other signaling pathways such as MAPK, cyclins, WNT, and reelin which have also been implicated as common factors between the ASD subtypes.

      The conclusions of this paper are mostly well supported by data, but for the cell migration assay, it is not clear if the authors control for initial differences in the inner cell mass area of the neurospheres in control vs ASD samples, which would affect the measurement of migration.

      Thank you for this thoughtful comment! When we first started our migration data, inner cell mass size was indeed a major concern for which we controlled in our methods. First, when plating the neurospheres, we would only collect spheres when a majority of spheres were approximately a diameter of 100 um. Very large spheres often could not be imaged due to being out of focus and very small spheres would often disperse when plated. Thus, there were some constraints to the variability of inner cell mass size.

      Furthermore, when we initially collected data, we conducted a proof of principal test to see if initial inner cell mass area (henceforth referred to as initial sphere size or ISS) influenced migration data. To do so, we obtained migration and ISS data from each diagnosis (Sib, NIH, I-ASD, 16pASD). Then we utilized R studio to see if there is a relationship between Migration and ISS in each diagnosis category using the equation (lm(Migration~ISS, data=bydiagnosis). In this equation, lm indicates linear modeling and (~) is a term used to ascertain the relationship between Migration and ISS and the term data=bydiagnosis allows the data to be organized by diagnosis

      The results were expressed as R-squared values indicating the correlation between ISS and Migration for each diagnosis and the p-value showing statistical significance for each comparison. As shown in Author response table 1, for each data set, there is minimal correlation between Migration and ISS in each data set. Moreover, there are no statistically significant relationships between Migration and ISS indicating that initial sphere size DOES NOT influence migration data in any of our data-sets.

      Author response table 1.

      Lastly, utilizing R, we modeled what predicted migration would be like for Sib, NIH, I-ASD, and 16pASD if we accounted for ISS in each group. Raw migration data was then plotted against the predicted data as in Author response image 3.

      Author response image 3.

      As shown in the graph, there are no statistical differences between the raw migration data (the data that we actually measured in the dish) and the modeled data in which ISS is accounted for as a variable. As such, we chose not to normalize to or account for ISS in our other experiments. We have now included the above R studio analyses in our supplemental figures (Figure S1) as well.

      Also, in Fig 5 and 6, panels I and J omit the effects of drug on mTOR phosphorylation as shown for other conditions.

      Both SC-79 and MK2206 were selected in our experiments after thorough analysis of their effects on human epithelial cells and other cultured cells (citations in manuscript). However, initially, we did not know whether either of these drugs would modulate the mTOR pathway in human NPCs, thus, in Figures 5A,5D, 6A and 6D we chose to focus on two of our data-sets to establish the effect of these drugs in human NPCs. Our experiments in Family-1 and Family-2 showed us that SC-79 increases PS6 in human NPCs while MK-2206 downregulates it. Once this was established, we knew the drugs would have similar effects in the NPCs from the other families. Thus, we only conducted a proof of principle test to confirm the drug does indeed have the intended effect in I-ASD-3 and 16pDel. We have included these proof of principle westerns in Figure 5I, 5K, 6I and 6K to show that the effects of these drugs are reproducible across all our NPC lines. We did not include quantification since the data is only from our single proof of principle western.

    1. Author Response

      Reviewer #1 (Public Review):

      Zhu et al. found that human participants could plan routes almost optimally in virtual mazes with varying complexity. They further used eye movements as a window to reveal the cognitive computations that may underly such close-to-optimal performance. Participants’ eye movement patterns included: (1) Gazes were attracted to the most task-relevant transitions (effectively the bottleneck transitions) as well as to the goal, with the share of the former increasing with maze complexity; (2) Backward sweeps (gazes moving from goal to start) and forward sweeps (gazes from start to goal) respectively dominated the pre-movement and movement periods, especially in more complex mazes. The authors explained the first pattern as the consequence of efficient strategies of information collection (i.e., active sensing) and connected the second pattern to neural replays that relate to planning.

      The authors have provided a comprehensive analysis of the eye movement patterns associated with efficient navigation and route planning, which offers novel insights for the area through both their findings and methodology. Overall, the technical quality of the study is high. The "toggling" analysis, the characterization of forward and backward sweeps, and the modeling of observers with different gaze strategies are beautiful. The writing of the manuscript is also elegant.

      I do not see any weaknesses that cannot be addressed by extended data analysis or modeling. The following are two major concerns that I hope could be addressed.

      We thank the reviewer for their positive assessment of our work!

      First, the current eye movement analysis does not seem to have touched the core of planning-evaluating alternative trajectories to the goal. Instead, planning-focused analyses such as forward and backward sweeps were all about the actually executed trajectory. What may participants’ eye movements tell us about their evaluation of alternative trajectories?

      This is an important point that we previously overlooked because our experimental design did not incorporate mutually exclusive alternative trajectories. Nonetheless, there are many trials in which participants had access to several possible trajectories to the goal. Some of those alternatives may be trivially suboptimal (e.g. highly convoluted trajectory, taking a slightly curved instead of straight trajectory, or setting out on the wrong path and then turning back). Using two simple constraints described in the Methods (no cyclic paths, limited amount of overlap between alternatives), we algorithmically identified the number of non-trivial alternative trajectories (or options) on each trial that were comparable in length to the chosen trajectory (within about 1 standard deviation). A few examples are shown below for the reviewer.

      The more plausible trajectory options there were, the more time participants spent gazing upon these alternatives during both pre-movement and movement (Figure 4 – figure supplement 1D – left). This is not a trivial effect resulting from the increase in surface area comprising the alternative paths because the time spent looking at the chosen trajectory also increased with the number of alternatives (Figure S8D – middle). Instead, this suggests that participants might be deliberating between comparable options.

      Consistent with this, the likelihood of gazing alternative trajectories peaked early on during pre-movement and well before performing sweeping eye movements (Figure 5D). During movement, the probability of gazing upon alternatives increases immediately before participants make a turn, suggesting that certain aspects of deliberation may also be carried out on the fly just before approaching choice points. Critically, during both pre-movement and movement epochs, the fraction of time spent looking at the goal location decreased with the number of alternatives (Figure 4 – figure supplement 1D – right), revealing a potential trade-off between deliberative processing and looking at the reward location. Future studies with more structured arena designs are needed to better understand the factors that lead to the selection of a particular trajectory among alternatives, and we mention this in the discussion (line 445):

      "Value-based decisions are known to involve lengthy deliberation between similar alternatives. Participants exhibited a greater tendency to deliberate between viable alternative trajectories at the expense of looking at the reward location. Likelihood of deliberation was especially high when approaching a turn, suggesting that some aspects of path planning could also be performed on the fly. More structured arena designs with carefully incorporated trajectory options could help shed light on how participants discover a near-optimal path among alternatives. However, we emphasize that deliberative processing accounted for less than onefifth of the spatial variability in eye movements, such that planning largely involved searching for a viable trajectory."

      Second, what cognitive computations may underly the observed patterns of eye movements has not received a thorough theoretical treatment. In particular, to explain why participants tended to fixate the bottleneck transitions, the authors hypothesized active sensing, that is, participants were collecting extra visual information to correct their internal model about the maze. Though active sensing is a possible explanation (as demonstrated by the authors’ modeling of "smart" observers), it is not necessarily the only or most parsimonious explanation. It is possible that their peripheral vision allowed participants to form a good-enough model about the maze and their eye movements solely reflect planning. In fact, that replays occur more often at bottleneck states is an emergent property of Mattar & Daw’s (2018) normative theory of neural replay. Forward and backward replays are also emergent properties of their theory. It might be possible to explain all the eye movement patterns-fixating the goal and the bottleneck transitions, and the forward and backward replays-based on Mattar & Daw’s theory in the framework of reinforcement learning. Of course, some additional assumptions that specify eye movements and their functional roles in reinforcement learning (e.g., fixating a location is similar to staying at the corresponding state) would be needed, analogous to those in the authors’ "smart" observer models. This unifying explanation may not only be more parsimonious than the author’s active sensing plus planning account, but also be more consistent with the data than the latter. After all, if participants had used fixations to correct their internal model of the maze, they should not have had little improvements across trials in the same maze.

      We thank the reviewer for this reference. We note the strong parallels between our eye movement results and that study in the discussion, in addition to proposing experimental variations that will help crystallize the link. Below, we included our response that was incorporated into the Discussion section (beginning at line 462).

      "In [a] highly relevant theoretical work, Mattar and Daw proposed that path planning and structure learning are variants of the same operation, namely the spatiotemporal propagation of memory. The authors show that prioritization of reactivating memories about reward encounters and imminent choices depends upon its utility for future task performance. Through this formulation, the authors provided a normative explanation for the idiosyncrasies of forward and backward replay, the overrepresentation of reward locations and turning points in replayed trajectories, and many other experimental findings in the hippocampus literature. Given the parallels between eye movements and patterns of hippocampal activity, it is conceivable that gaze patterns can be parsimoniously explained as an outcome of such a prioritization scheme. But interpreting eye movements observed in our task in the context of the prioritization theory requires a few assumptions. First, we must assume that traversing a state space using vision yields information that has the same effect on the computation of utility as does information acquired through physical navigation. Second, peripheral vision allows participants to form a good model of the arena such that there is little need for active sensing. In other words, eye movements merely reflect memory access and have no computational role. Finally, long-term statistics of sweeps gradually evolve with exposure, similar to hippocampal replays. These assumptions can be tested in future studies by titrating the precise amount of visual information available to the participants, and by titrating their experience and characterizing gaze over longer exposures. We suspect that a pure prioritization-based account might be sufficient to explain eye movements in relatively uncluttered environments, whereas navigation in complex environments would engage mechanisms involving active inference. Developing an integrative model that features both prioritized memory-access as well as active sensing to refine the contents of memory, would facilitate further understanding of computations underlying sequential decision-making in the presence of uncertainty."

      In the original manuscript, we referred to active sensing and planning in order to ground our interpretation in terminology that has been established in previous works by other groups, which had investigated them in isolation. Although the role active sensing could be limited, we are unable to conclude that eye movements solely reflect planning. Even if peripheral vision is sufficient to obtain a good-enough model of the environment, eye movements can further reduce uncertainty about the environment structure especially in cluttered environments such as the complex arena used in this study. This reduction in uncertainty is not inconsistent with a lack of performance improvement across trials. This is because the lack of improvement could be explained by a failure to consolidate the information gathered by eye movements and propagate them across trials, an interpretation that would also explain why planning duration is stable across trials (Figure 2 – figure supplement 2B). Furthermore, participants gaze at alternative trajectories more frequently when more options are presented to them. However we acknowledge that this is a fundamental question, and identified this as an important topic for follow up studies and outline experiments to delineate the precise extent to which eye movements reflect prioritized memory access vs active sensing. Briefly, we can reduce the contribution of active sensing by manipulating the amount of visual information – ranging from no information (navigating in the dark) to partial information (foveated rendering in VR headset). Likewise, we can increase the contribution of memory by manipulating the length of the experiment to ensure participants become fully familiar with the arena. Yet another manipulation is to use a fixed reward location for all trials such that experimental conditions would closely match the simulations of the prioritization model. We are excited about performing these follow up experiments.

      Reviewer #2 (Public Review):

      In this study the authors sought to understand how the patterns of eye-movements that occur during navigation relate to the cognitive demands of navigating the current environment. To achieve this the authors developed a set of mazes with visible layouts that varied in complexity. Participants navigated these environments seated on a chair by moving in immersive virtual reality.

      The question of how eye-movements relate to cognitive demands during navigation is a central and often overlooked aspect of navigating an environment. Study eye-movements in dynamic scenarios that enable systematic analysis is technically challenging, and hence why so few studies have tackled this issue.

      The major strengths of this study are the technical development of the set up for studying, recording and analysing the eye-movements. The analysis is extensive and allows greater insight than most studies exploring eye-movements would provide. The manuscript is also well written and argued.

      A current weakness of the manuscript is that several other factors have not been considered that may relate to the eye-movements. More consideration of these would be important.

      We thank the reviewer for their positive assessment of the innovative aspects of this study. We have tried to address the weaknesses by performing additional analyses described below.

      1. In the experimental design it appears possible to separate the length of the optimal path from the complexity of the maze. But that appears not to have been done in this design. It would be useful for the authors to comment on this, as these two parameters seem critically important to the interpretation of the role of eye-movements - e.g. a lot of scanning might be required for an obvious, but long path, or a lot of scanning might be required to uncover short path through a complex maze.

      This is a great point. We added a comment to the Discussion at line 489 to address this:

      "Future work could focus on designing more structured arenas to experimentally separate the effects of path length, number of subgoals, and environmental complexity on participants’ eye movement patterns."

      To make the most of our current design, we performed two analyses. First, we regressed trial-specific variables simultaneously against path length and arena complexity. This analysis revealed that the effect of complexity on behavior persists even after accounting for path length differences across arenas (Figure 4 – figure supplement 3). Second, path length is but one of many variables that collectively determine the complexity of the maze. Therefore, we also analyzed the effects of multiple trial-specific variables (number of turns, length of the optimal path, and the degree to which participants are expected to turn back the initial direction of heading to reach the goal, regardless of arena complexity) on eye movements. This revealed fine-grained insights on which task demands most influenced each eye movement quality that was described. More complex arenas posed, on average, greater challenges in terms of longer and more winding trajectories, such that eye movement qualities which increased with arena complexity also generally increased with specific measures of trial difficulty, albeit to varying degrees. We added additional plots to the main/supplementary figures and described these analyses under a new heading (“Linear mixed effects models”) in the Methods section.

      1. Similarly, it was not clear how the number of alternative plausible paths was considered in the analysis.It seems possible to have a very complex maze with no actual required choices that would involve a lot of scanning to determine this, or a very simple maze with just two very similar choices but which would involve significant scanning to weight up which was indeed the shortest.

      Thank you for the suggestion. In conjunction with our response to the first comment from Reviewer #1, we used some constraints to identify non-trivial alternative trajectories – trajectories that pass through different locations in the arena but are roughly similar in length (within about 1 SD of the chosen trajectory). In alignment with your intuition, the most complex maze, as well as the completely open arena, did not have non-trivial alternative trajectories. For the three arenas of medium complexity, the more open arenas had more non-trivial alternative trajectories.

      When we analyzed the relative effect of the number of alternative trajectories on eye movement, we found that both possibilities you suggested are true. On trials with many comparable alternatives, participants indeed spend more time scanning the alternatives and less time looking at the goal (Figure S8D). Likewise, in the most complex maze where there are no alternatives, participants still spent much more time (than simpler mazes) learning about the arena structure at the expense of looking at the goal (Figure 3E-F). This analysis yielded interesting new insights into how participants solved the task and opens the door for investigating this trade-off in future work. More generally, because both deliberation and structure learning appear to drive eye movements, they must be factored into studies of human planning.

      1. Can the affordances linked to turning biases and momentum explain the error patterns? For example,paths that require turning back on the current trajectory direction to reach the goal will be more likely to cause errors, and patterns of eye-movements that might be related to such errors.

      Thank you for this question. In conjunction with the trial-specific analyses on the effect of the length of the trajectory (Point #1) on errors and eye movement patterns, we also looked into how the number of turns and the relative bearing (angle between the direction of initial heading and the direction of target approach) affected participants’ behavior. Turns and momentum do not affect the relative error (distance of the stopping location to the target) as much as the trajectory length does, which was unexpected (Figure 1 – figure supplement 1F). This supports that errors were primarily caused by forgetting the target location, and this memory leak gets worse with distance (or time). However, turns have an influence on eye movements in general. For example, more turns generally result in an increase in the fraction of time that participants spend gazing upon the trajectory (Figure 4 – figure supplement 1A) and sweeping (Figure 4D). Furthermore, the number of turns decreased the fraction of time participants spent gazing at the target during movement (Figure 2D).

      1. Why were half the obstacle transitions miss-remembered for the blind agent? This seems a rather arbitrary choice. More information to justify this would be useful.

      We tested out different percentages and found qualitatively similar results. The objective was to determine the patterns of eye movements that would be most beneficial when participants have an intermediate level of knowledge about the arena configuration (rather than near-zero or near-perfect), because during most trials, participants can also use peripheral vision to assess the rough layout, but they do not precisely remember the location of the obstacles. We added this explanation to Appendix 1, where the simulation details have been made in response to a suggestion by another reviewer.

      1. The description of some of the results could usefully be explained in more simple terms at various pointsto aid readers not so familiar with the RL formation of the task. For example, a key result reported is that participants skew looking at the transition function in complex environments rather than the reward function. It would be useful to relate this to everyday scenarios, in this case broadly to looking more at the junctions in the maze than at the goal, or near the goal, when the maze is complex.

      This is a great suggestion. We added an everyday analogy when describing the trade-off on line 258.

      "The trade-off reported here is roughly analogous to the trade-off between looking ahead towards where you’re going and having to pay attention to signposts or traffic lights. One could get away with the former strategy while driving on rural highways whereas city streets would warrant paying attention to many other aspects of the environment to get to the destination."

      1. The authors should comment on their low participant sample size. The sample seems reasonable giventhe reproducibility of the patterns, but it is much lower than most comparable virtual navigation tasks.

      Thank you for the recommendation. We had some difficulties recruiting human participants who were willing to wear a headset which had been worn by other participants during COVID-19, and some participants dropped out of the study due to feeling motion sickness. To ameliorate the low sample size, we collected data on four more participants and performed analyses to confirm that the major findings may be observed in most individual participants. Participant-specific effects are included in the new plots made in response to Points # 1-3, and the number of participants with a significant result for each figure/panel has been included as Appendix 2 – table 3.

      Reviewer #3 (Public Review):

      In this article, Zhu and colleagues studied the role of eye movements in planning in complex environments using virtual reality technology. The main findings are that humans can 1) near optimally navigate in complex environments; 2) gaze data revealed that humans tend to look at the goal location in simple environments, but spend more time on task relevant structures in more complex tasks; 3) human participants show backward and forward sweeping mostly during planning (pre-movement) and execution (movement), respectively.

      I think this is a very interesting study with a timely question and is relevant to many areas within cognitive neuroscience, notably decision making, navigation. The virtual reality technology is also quite new for studying planning. The manuscript has been written clearly. This study helps with understanding computational principles of planning. I enjoyed reading this work. I have only one major comment about statistical analyses that I hope authors can address.

      We thank the reviewer for the accurate description and positive assessment of our work.

      Number of subjects included in analyses in the study is only nine. This is a very small sample size for most human studies. What was the motivation behind it? I believe that most findings are quite robust, but still 9 subjects seems too low. Perhaps authors can replicate their finding in another sample? Alternatively, they might be able to provide statistics per individual and only report those that are significant in all subjects (of course, this only works if reported effects are super robust. But only in such a case 9 subjects are sufficient.)

      Thank you for the suggested alternatives. Due to the pandemic, we had some difficulties recruiting human participants who were willing to wear a headset which had been worn by other participants. We collected data on four more participants and included them in the analyses, and also confirmed that the major findings are observed in most individuals. The number of participants with a significant result for each analysis has been included in Figure 1 – figure supplement 3 and Appendix 2 – table 3.

      Somewhat related to the previous point, it seems to me that authors have pooled data from all subjects (basically treating them as 1 super-subject?) I am saying this based on the sentence written on page 5, line 130: "Because we are interested in principles that are conserved across subjects, we pooled subjects for all subsequent analyses." If this is not the case, please clarify that (and also add a section on "statistical analyses" in Methods.) But if this is the case, it is very problematic, because it means that statistical analyses are all done based on a fixed-effect approach. The fixed effect approach is infamous for inflated type I error.

      Your interpretation is correct and we acknowledge your concern about pooling participants. We had done this after observing that our results were consistent across participants but this was not demonstrated. We have now performed analyses sensitive to participant-specific effects and find that all major results hold for most participants, and we included additional main and supplementary bar plots (and tables in Appendix 2) showing per-participant data. The new plots/table show the effect of independent variables (mainly trial/arena difficulty) on dependent variables for each participant, as well as general effects conserved across participants. A new paragraph was added to the Methods section to describe the “Linear mixed effects models” which we used.

      Again, quite related to the last two points: please include degrees of freedom for every statistical test (i.e. every reported p-value).

      Degrees of freedom (df) are now included along with each p-value.

    1. Author Response

      Reviewer #1 (Public Review):

      Using fMRI-based univariate and multivariate analyses, Root, Muret, et al. investigated the topography of face representation in the somatosensory cortex of typically developed two-handed individuals and individuals with a congenital and acquired missing hand. They provide clear evidence for an upright face topography in the somatosensory cortex in all three groups. Moreover, they find that one-handers, but not amputees, show shorter distances from lip representations to the hand area, suggesting a remapping of the lips. They also find a shift away of the upper face from the deprived hand area in one-handers, and significantly greater dissimilarity between face part representations in amputees and one-handers. The authors argue that this pattern of remapping is different to that of cortical neighborhood theories and points toward a remapping of face parts which have the ability to compensate for hand function, e.g., using the lips/mouth to manipulate an object.

      These findings provide interesting insights into the topographic organization of face parts and the principles of cortical (re)organization. The authors use several analytical approaches, including distance measures between hand- and face-part-responsive regions and representational similarity analysis (RSA). Particularly commendable is the rigorous statistical analysis, such as the use of Bayesian comparisons, and careful interpretation of absent group differences.

      We thank the reviewer for their positive and constructive feedback.

      Reviewer #2 (Public Review):

      After amputation, the deafferented limb representation in the somatosensory cortex is activated by stimulation of other body parts. A common belief is that the lower face, including the lips, preferentially "invades" deafferented cortex due to its proximity to cortex. In the present study, this hypothesis is tested by mapping the somatosensory cortex using fMRI as amputees, congenital one-handers, and controls moved their forehead, nose, lips or tongue. First, they found that, unlike its counterpart in monkeys, the representation of the face in the somatosensory cortex is right-side up, with the forehead most medial (and abutting the hand) and the lips most lateral. Second, there was little evidence of "reorganization" of the deafferented cortex in amputees, even when tested with movements across the entire face rather than only the lips. Third, congenital one-handers showed significant reorganization of deafferented cortex, characterized principally by the invasion of the lower face, in contrast to predictions from the hypothesis that proximity was the driving factor. Fourth, there was no relationship between phantom limb pain reports and reorganization.

      As a non-expert in fMRI, I cannot evaluate the methodology. That being said, I am not convinced that the current consensus is that the representation of the face in humans is flipped compared to that of monkeys. Indeed, the overwhelming majority of somatosensory homunculi I have seen for humans has the face right side up. My sense is that the fMRI studies that found an inverted (monkey-like) face representation contradict the consensus.

      Thank you for point this out. As we tried to emphasise in the introduction, very few neuroimaging studies actually investigated face somatotopy in humans, with inconsistent results. We agree the default consensus tends to be dominated by the up-right depiction of Penfield’s homunculus (recently replicated by Roux et al, 2018). However, due to methodological and practical constraints, alignment across subjects in the case of intracortical recordings is usually difficult to achieve, and thus makes it difficult to assess the consistency in topographical organisation. Moreover, previous imaging studies did not manage to convincingly support Penfield’s homunculus. For these two key reasons, the spatial orientation of the human facial homunculus is still debated. A further limiting factor of previous studies in humans is that the vast majority of human studies investigating face (re)mapping in humans focused solely on the lip representation, using the cortical proximity hypothesis to interpret their results. Consequently, as we highlight above in our response to the Editor, there is a wide-spread and false representation in the human literature of the lips neighbouring the hand area.

      To account for the reviewer’s critic and convey some of this context, we changed our title from: Reassessing face topography in primary somatosensory cortex and remapping following hand loss; to: Complex pattern of facial remapping in somatosensory cortex following congenital but not acquired hand loss. This was done to de-emphasise the novelty of face topography relative to our other findings.

      We also rewrote our introduction (lines 79-94) as follows:

      “The research focus on lip cortical remapping in amputees is based on the assumption that the lips neighbour the hand representation. However, this assumption goes against the classical upright orientation of the face in S126–30, as first depicted in Penfield’s Homunculus and in later intracortical recordings and stimulation studies26–29, with the upper-face (i.e., forehead) bordering the hand area. In contrast, neuroimaging studies in humans studying face topography provided contradictory evidence for the past 30 years. While a few neuroimaging studies provided partial evidence in support of the traditional upright face organisation31, other studies supported the inverted (or ‘upside-down’) somatotopic organisation of the face, similar to that of non-human primates32,33. Other studies suggested a segmental organisation34, or even a lack of somatotopic organisation35–37, whereas some studies provided inconclusive or incomplete results38–41. Together, the available evidence does not successfully converge on face topography in humans. In line with the upright organisation originally suggested by Penfield, recent work reported that the shift in the lip representation towards the missing hand in amputees was minimal42,43, and likely to reside within the face area itself. Surprisingly, there is currently no research that considers the representation of other facial parts, in particular the upper-face (e.g., the forehead), in relation to plasticity or PLP.”

      We also updated the discussion accordingly (lines 457, 469-477, 490-492).

      Similarly, it is not clear to me how the observations (1) of limited reorganization in amputees, (2) of significant reorganization in congenital one-handers, and (3) of the lack of relationship between PLP and reorganization is novel given the previous work by this group. Perhaps the authors could more clearly articulate the novelty of these results compared to their previous findings.

      Thank you for giving us the opportunity to clarify on this important point. The novelty of these results can be summarised as follow:

      (1) Conceptually, it is crucial for us to understand if deprivation-triggered plasticity is constrained by the local neighbourhood, because this can give us clues regarding the mechanisms driving the remapping. We provide strong topographic evidence about the face orientation in controls, amputees and one-handers.

      (2) The vast majority of previous research on brain plasticity following hand loss (both congenital and acquired) in humans has exclusively focused on the lower face, and lips in particular. We provide systematic evidence for stable organisation and remapping of the neighbouring upper face, as well as the lower face. We also study topographic representation of the tongue (and nose) for the first time.

      (3) The vast majority of previous research on brain remapping following hand loss (both congenital and acquired, neuroimaging and electrophysiological) was focused on univariate activity measures, such as the spatial spread of units showing a similar feature preference, or the average activity level across individual units. We are going beyond remapping by using RSA, which allows us to ask not only if new information is available in the deprived cortex (as well as the native face area), but also whether this new information is structured consistently across individuals and groups. We show that representational content is enhanced in the deprived cortex one-handers whereas it is stable in amputees relative to controls (and to their intact hand region).

      (4) Based on previous studies, the assumption was that reorganisation in congenital one-handers was relatively unspecific, affecting all tested body parts. Here, we provide evidence for a more complex pattern of remapping, with the forehead representation seemingly moving out of the missing hand region (and the nose representation being tentatively similar to controls). That is, we show not just “invasion” but also a shift of the neighbour away from the hand area which has never been documented (or in fact suggested).

      (5) Using Bayesian analyses we provide definitive evidence against a relationship between PLP and forehead remapping, providing first and conclusive evidence against the remapping hypothesis, based on cortical neighbourhood.

      Our inclination is not to add a summary paragraph of these points in our discussion, as it feels too promotional. Instead, we have re-written large sections of the introduction and discussion to better emphasise each of these points separately throughout the text, where the context is most appropriate. Given the public review strategy taken by eLife, the novelty summary provided above will be available for any interested reader, as part of the public review process. However, should the reviewer feel that a novelty summary paragraph is required (or an emphasis on any of the points summarised above), we will be happy to revise the manuscript accordingly.

      Finally, Jon Kaas and colleagues (notably Niraj Jain) have provided evidence in experiments with monkeys that much of the observed reorganization in the somatosensory cortex is inherited from plasticity in the brain stem. Jain did not find an increased propensity for axons to cross the septum between face and hand representations after (simulated) amputation. From this perspective, the relevant proximity would be that of the cuneate and trigeminal nuclei and it would be critical to map out the somatotopic organization of the trigeminal and cuneate nuclei to test hypotheses about the role of proximity in this remapping.

      Thank you for highlighting this very relevant point, which we are well aware of. We fully agree with the reviewer that this is an important goal for future study, but functional imaging of the brainstem in humans is particularly challenging and would require ultra high field imaging (7T) and specialised equipment. We have encountered much local resistance due to hypothetical issues for MRI safety for scanning amputees in this higher field strength, meaning we are unable to carry out this research ourselves. Our former lab member Sanne Kikkert, who is now running her independent research programme in Zurich, has been working towards this goal for the past 4 years. So we can say with confidence that this aim is well beyond the scope of the current study. In response to your comment, we mentioned this potential mechanism in the introduction (lines 98-101), we ensured that we only referred to “cortical proximity” throughout our manuscript, and we circle back to this important point in the discussion.

      Lines 539-543: “Moreover, even if the remapping we observed here goes against the theory of cortical proximity, it can still arise from representational proximity at the subcortical level, in particular at the brainstem level44,45. While challenging in humans, mapping both the cuneate and trigeminal nuclei would be critical to provide a more complete picture regarding the role of proximity in remapping.”

      Reviewer #3 (Public Review):

      In their study, the authors set up to challenge the long-held claim that cortical remapping in the somatosensory cortex in hand deprived cortical territories follows somatotopic proximity (the hand region gets invaded by cortical neighbors) as classically assumed. In contrast to this claim, the authors suggest that remapping may not follow cortical proximity but instead functional rules as to how the effector is used. Their data indeed suggest that the deprived hand area is not invaded by the forefront which is the cortical neighbor but instead by the lips which may compensate for hand loss in manipulating objects. Interestingly the authors suggest this is mostly the case for one-handers but not in amputees for who the reorganization seems more limited in general (but see my comments below on this last point).

      This is a remarkably ambitious study that has been skilfully executed on a strong number of participants in each group. The complementarity of state-of-the-art uni- and multi-variate analyses are in the service of the research question, and the paper is clearly written. The main contribution of this paper, relative to previous studies including those of the same group, resides in the mapping of multiple face parts all at once in the three groups.

      We are grateful to the reviewer for appreciating the immense effort that this study involved.

      In the winner takes all approach, the authors only include 3 face parts but exclude from the analyses the nose and the thumb. I am not fully convinced by the rationale for not including nose in univariate analyses - because it does not trigger reliable activity - while keeping it for representational similarity analyses. I think it would be better to include the nose in all analyses or demonstrate this condition is indeed "noisy" and then remove it from all the analyses. Indeed, if the activity triggered by nose movement is unreliable, it should also affect multivariate.

      Following this comment, we re-ran all univariate analyses to include the nose, and updated throughout the main text and supplemental results and related figures. In short, adding the nose did not change the univariate results, apart from a now significant group x hemisphere interaction for the CoG of the tongue when comparing amputees and controls, matching better the trends for greater surface coverage in the deprived hand ROI of amputees. Full details are provided in our response to Reviewer 1 above.

      The rationale for not including the hand is maybe more convincing as it seems to induce activity in both controls and amputees but not in one-handers. First, it would be great to visualize this effect, at least as supplemental material to support the decision. Then, this brings the interesting possibility that enhanced invasion of hand territory by lips in one-handers might link to the possibility to observe hand-related activity in the presupposed hand region in this population. Maybe the authors may consider linking these.

      Thank you for this comment. As we explain in our response to Reviewer 1 above, we did not intent the thumb condition in one-handers for analysis, as the task given to one-handers (imagine moving a body part you never had before) is inherently different to that given to the other groups (move - or at least attempt to move - your (phantom) hand). As such, we could not pursuit the analysis suggested by the reviewer here. To reduce the discrepancy and following Reviewer 1’s advice, we decided to remove the hand-face dissimilarity analysis which we included in our original manuscript, and might have sparked some of this interest. Upon reflection we agreed that this specific analysis does not directly relate to the question of remapping (but rather of shared representation), in addition to making the paper unbalanced. We will now feature this analysis in another paper that appears more appropriate in the context of referred sensations in amputees (Amoruso et al, 2022 MedRxiv).

      The use of the geodesic distance between the center of gravity in the Winner Take All (WTA) maps between each movement and a predefined cortical anchor is clever. More details about how the Center Of Gravity (COG) was computed on spatially disparate regions might deserve more explanations, however.

      We are happy to provide more detail on this analysis, which weights the CoG based on the clusters size (using the workbench command -metric-weighted-stats). Let’s consider the example shown here (Figure 1) for a single control participant, where each CoG is measured either without weighting (yellow vertices) or with cluster weighting (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red). When the movement produces a single cluster of activity (the lips in the non-dominant hemisphere, shown in blue), the CoG’s location was identical for both weighted (red) and unweighted (yellow) calculations. But other movements, such as the tongue (green), produced one large cluster (at the lateral end), with a few more disparate smaller clusters more medially. In this case, the larger cluster of maximal activity is weighted to a greater extent than the smaller clusters in the CoG calculation, meaning the CoG is slightly skewed towards it (dark red), relative to the smaller clusters.

      Figure 1. Centre-of-gravity calculation, weighted and unweighted by cluster size, in an example control participant. Here the winner-takes-all output for each facial movement (forehead=red, lips=blue, tongue=green) was used to calculate the centre-of-gravity (CoG) at the individual-level in both the dominant (left-hand side) and non-dominant (right-hand side) hemisphere, weighted by cluster size (forehead CoG=red, lip CoG=dark blue, tongue CoG=dark red), compared to an unweighted calculation (denoted by yellow dots within each movements’ winner-takes-all output).

      This is now explained in the methods (lines 760-765) as follows:

      “To assess possible shifts in facial representations towards the hand area, the centre-of-gravity (CoG) of each face-winner map was calculated in each hemisphere. The CoG was weighted by cluster size meaning that in the event of multiple clusters contributing to the calculation of a single CoG for a face-winner map, the voxels in the larger cluster are overweighted relative to those in the smaller clusters. The geodesic cortical distance between each movement’s CoG and a predefined cortical anchor was computed.”

      Moreover, imagine that for some reason the forefront region extends both dorsally and ventrally in a specific population (eg amputees), the COG would stay unaffected but the overlap between hand and forefront would increase. The analyses on the surface area within hand ROI for lips and forehead nicely complement the WTA analyses and suggest higher overlap for lips and lower overlap for forehead but none of the maps or graphs presented clearly show those results - maybe the authors could consider adding a figure clearly highlighting that there is indeed more lip activity IN the hand region.

      We agree with you on this limitation of the CoG and this is why we interpret all cortical distances analyses in tandem with the laterality indices. The laterality indices correspond to the proportion of surface area in the hand region for a given face part in the winner-maps.

      Nevertheless, to further convince the Reviewer, we extracted activity levels (beta values) within the hand region of congenitals and controls, and we ran (as for CoGs) a mixed ANOVA with the factors Hemisphere (deprived x intact) and Group (controls x one-handers).

      As expected from the laterality indices obtained for the Lips, we found a significant group x hemisphere interaction (F(1,41)=4.52, p=0.040, n2p=0.099), arising from enhanced activity in the deprived hand region in one-handers compared to the non-dominant hand region in controls (t(41)=-2.674, p=0.011) and to the intact hand region in one-handers (t(41)=-3.028, p=0.004).

      Since this kind of analysis was the focus of previous studies (from which we are trying to get away) and since it is redundant with the proportion of face-winner surface coverage in the hand region, we decided not to include it in the paper. But we could add it as a Supplementary result if the Reviewer believes this strengthens our interpretation.

      In addition to overlap analyses between hand and other body parts, the authors may also want to consider doing some Jaccard similarity analyses between the maps of the 3 groups to support the idea that amputees are more alike controls than one-handers in their topographic activity, which again does not appear clear from the figures.

      We thank the reviewers for this clever suggestion. We now include the Jaccard similarity analysis, which quantified the degree of similarity (0=no overlap between maps; 1=fully overlapping) between winner-takes-all maps (which included the nose; akin to the revised univariate results) across groups. For each face part/amputee, the similarity with the 22 controls and 21 one-handers respectively was averaged. We utilised a linear mixed model which included fixed factors of Group (One-handers x Controls), Movement (Forehead x Nose x Lips x Tongue) and Hemisphere (Intact x Deprived) on Jaccard similarity values (similar to what we used for the RSA analysis). A random effect of participant, as well as covariates of ages, were also included in the model.

      Results showed a significant group x hemisphere interaction (F(240.0)=7.70, p=0.006; controlled for age; Fig. 5), indicating that amputees’ maps showed different similarity values to controls’ and one-handers’ depending on the hemisphere. Post-hoc comparisons (corrected alpha=0.025; uncorrected p-values reported) revealed significantly higher similarity to controls’ than to one-handers’ maps in the deprived hemisphere (t(240)=-3.892, p<.001). Amputees’ maps also showed higher similarity to controls’ maps in the deprived relative to the intact hemisphere (t(240)=2.991, p=0.003). Amputees, therefore, displayed greater similarity of facial somatotopy in the deprived hemisphere to controls, suggesting again fewer evidence for cortical remapping in amputees.

      We added these results at the end of the univariate analyses (lines 335-351) and in the discussion (lines 464-465 and 497-500).

      This brings to another concern I have related to the claim that the change in the cortical organization they observe is mostly observed in one-handers. It seems that most of this conclusion relies on the fact that some effects are observed in one-handers but not in amputees when compared to controls, however, no direct comparisons are done between amputees and one-handers so we may be in an erroneous inference about the interaction when this is actually not tested (Nieuwenhuis, 11). For instance, the shift away from the hand/face border of the forehead is also (mildly) significant in amputees (as observed more strongly in one-handers) so the conclusion (eg from the subtitle of the results section) that it is specific to one-hander might not fully be supported by the data. Similar to the invasion of the hand territory from the lips which is significant in amputees in terms of surface area. All together this calls for toning down the idea that plasticity is restricted to congenital deprivation (eg last sentence of the abstract). Even if numerically stronger, if I am not wrong, there are no stats showing remapping is indeed stronger in one-handers than in amputees and actually, amputees show significant effects when compared to controls along the lines as those shown (even if more strongly) in one-handers.

      Thank you for this very important comment. We fully agree – the RSA across-groups comparison is highly informative but insufficient to support our claims. We did not compare the groups directly to avoid multiple comparisons (both for statistical reasons and to manage the size of the results section). But the reviewer’s suggestion to perform a Jaccard similarity analysis complements very nicely the univariate and multivariate results and allows for a direct (and statistically lean) comparison between groups, to assess whether amputees are more similar to controls or to congenital one-handers, taking into account all aspects of their maps (both spatial location/CoG and surface coverage). We added the Jaccard analysis to the main text, at the end of the univariate results (lines 335-385). The Jaccard analysis suggests that amputees’ maps in the deprived hemisphere were more similar to the maps of controls than to the ones of congenital one-handers. This allowed us to obtain significant statistical results to support the claim that remapping is indeed stronger in one-handers than in amputees (lines 346-351). We also compared both amputees and one-handers to the control group. In line with our univariate results, this revealed that the only face part for which controls were more similar to one-handers than to amputees was the tongue (lines 379-381). And that the forehead remapping observed at the univariate level in amputees (surface area), is likely to arise from differences in the intact hemisphere (lines 381-383).

      Finally, we also added the post-hoc statistics comparing amputees to congenitals in the RSA analysis (lines 425-427): “While facial information in the deprived hand area was increased in one-handers compared with amputees, this effect did not survive our correction for multiple comparisons (t(70.7)=-2.117, p=0.038).”

      Regarding the univariate results mentioned by the reviewer, we would like to emphasise that we had no significant effect for the lips in amputees, though we agree the surface area appears in between controls and one-handers. But this laterality index was not different from zero. This test is now added lines 189-190. Regarding the forehead, we fully agree with the Reviewer, and we adjusted the subtitle accordingly (lines 241-242). For consistency, we also added the t-test vs zero for the forehead surface area (non-significant, lines 251-253).

      Also, maybe the authors could explore whether there is actually a link between the number of years without hand and the remapping effects.

      To address this question, we explored our data using a correlation analysis. The only body part who showed some suggestive remapping effects was the tongue, and so we explored whether we could find a relationship (Pearson’s correlation) between years since amputation and the laterality index of the Tongue in amputees (r = 0.007, p=0.980, 95% CI [-0.475, 0.475]). We also explored amputees’ global Jaccard similarity values to controls in the deprived hemisphere (r = -0.010, p=0.970, 95% CI [-0.488, 0.473]), and could not find any relationship. Considering there was no strong remapping effect to explain, we find this result too exploratory to include in our manuscript.

      One hypothesis generated by the data is that lips remap in the deprived hand area because lips serve compensatory functions. Actually, also in controls, lips and hands can be used to manipulate objects, in contrast to the forehead. One may thus wonder if the preferential presence of lips in the hand region is not latent even in controls as they both link in functions?

      We agree with the reviewer’s reasoning, and we think that the distributed representational content we recently found in two-handers (Muret et al, 2022) provides a first hint in this direction. It is worth noting that in that previous publication we did not find differences across face parts in the activity levels obtained in the hand region, except for slightly more negative values for the tongue. But we do think that such latent information is likely to provide a “scaffolding” for remapping. While the design of our face task does not allow to assess information content for each face part (as done for the lips in Muret et al, 2022), this should be further investigated in follow-up studies.

      We added a sentence in the discussion to highlight this interesting notion: Lines 556-559: “Together with the recent evidence that lip information content is already significant in the hand area of two-handed participants (Muret et al, 2022), compensatory behaviour since developmental stages might further uncover (and even potentiate) this underlying latent activity.”

    1. Author Response

      Reviewer #1 (Public Review):

      Point 1: Many of the initial analyses of behavior metrics, for instance predicting reaction times, number of fixations, or fixation duration, use value difference as a regressor. However, given a limited set of values, value differences are highly correlated with the option values themselves, as well as the chosen value. For instance, in this task the only time when there will be a value difference of 4 drops is when the options are 1 and 5 drops, and given the high performance of these monkeys, this means the chosen value will overwhelmingly be 5 drops. Likewise, there are only two combinations that can yield a value difference of 3 (5 vs. 2 and 4 vs 1), and each will have relatively high chosen values. Given that value motivates behavior and attracts attention, it may be that some of the putative effects of choice difficulty are actually driven by value.

      To address this question, we have adapted the methods of Balewski and colleagues (Neuron, 2022) to isolate the unique contributions of chosen value and trial difficulty to reaction time and the number of fixations in a given trial (the two behaviors modulated by difficulty in the original paper). This new analysis reveals a double dissociation in which reaction time decreases as a function of chosen value but not difficulty, while the number of fixations in a trial shows the opposite pattern. Our interpretation is that reaction time largely reflects reward anticipation, whereas the number of fixations largely reflects the amount of information required to render a decision (i.e., choice difficulty). See lines 144-167 and Figure 2.

      Point 2: Related to point 1, the study found that duration of first fixations increased with fixated values, and second (middle) fixation durations decreased with fixated value but increased with relative value of the fixated versus other value. Can this effect be more concisely described as an effect of the value of the first fixated option carrying over into behavior during the second fixation?

      This is a valid interpretation of the results. To test this directly, we now include an analysis of middle fixation duration as a function of the not-currentlyviewed target. Note that the vast majority of middle fixations are the second fixation in the trial, and therefore the value of the unattended target is typically the one that was viewed first. The analysis showed a negative correlation between middle fixation duration and the value of the unattended target which is consistent with the first fixated value carrying over to the second fixation. See lines 243-246.

      Point 3: Given that chosen (and therefore anticipated) values can motivate responses, often measured as faster reaction times or more vigorous motor movements, it seems curious that terminal non-decision times were calculated as a single value for all trials. Shouldn't this vary depending at least on chosen values, and perhaps other variables in the trial?

      In all sequential sampling model formulations we are aware of, nondecision time is considered to be fixed across trial types. Examples can be found for perceptual decisions (e.g., Resulaj et al., 2009) and in the “bifurcation point” approach used in the recent value-based decision study by Westbrook et al. (2020).

      To further investigate this issue, we asked whether other post-decision processes were sensitive to chosen value in our paradigm. To do so, we measured the interval between the center lever lift and the left or right lever press, corresponding to the time taken to perform the reach movement in each trial (reach latency). We then fit a mixed effects model explaining reach latency as a function of chosen value. While the results showed significantly faster reach latencies with higher chosen values, the effect size was very small, showing on average a ~3ms decrease per drop of juice. In other words, between the highest and lowest levels of chosen value (5 vs. 1), there is only a difference of approximately 12ms. In contrast, the main RT measure used in the study (the interval between target onset and center lever lift) is an order of magnitude more sensitive to chosen value, decreasing ~40ms per drop of juice. These results are shown in Author response image 1.

      Author response image 1.

      This suggests that post-decision processes (NDT in standard models and the additive stage in the Westbrook paper) vary only minimally as a function of chosen value. We are happy to include this analysis as a supplemental figure upon request.

      Point 4: The paper aims to demonstrate similarities between monkey and human gaze behavior in value-based decisions, but focuses mainly on a series of results from one group of collaborators (Krajbich, Rangel and colleagues). Other labs have shown additional nuance that the present data could potentially speak to. First, Cavanaugh et al. (J Exp Psychol Gen, 2014) found that gaze allocation and value differences between options independently influence drift rates on different choices. Second, gaze can correlate with choice because attention to an option amplifies its value (or enhances the accumulation of value evidence) or because chosen options are attended more after the choice is implicitly determined but not yet registered. Westbrook et al. (Science, 2020) found that these effects can be dissociated, with attention influencing choice early in the trial and choice influencing attention later. The NDTs calculated in the present study allot a consistent time to translating a choice into a motor command, but as noted above don't account for potential influences of choice or value on gaze.

      The two-stage model of gaze effects put forth by Westbrook et al. (2020) is consistent with other observations of gaze behavior and choice (i.e., Thomas et al., 2019, Smith et al., 2018, Manohar & Husain, 2013). In this model, gaze effects early in the trial are best described by a multiplicative relationship between gaze and value, whereas gaze effects later in the trial are best described with an additive model term. To test the two-stage hypothesis, Westbrook and colleagues determined a ‘bifurcation point’ for each subject that represented the time at which gaze effects transitioned from multiplicative to additive. In our data, trial durations were typically very short (<1s), making it difficult to divide trials and fit separate models to them. We therefore took at different approach: We reasoned that if gaze effects transition from multiplicative to additive at the end of the trial, then the transition point could be estimated by removing data from the end of each trial and assessing the relative fit of a multiplicative vs. additive model. If the early gaze effects are predominantly multiplicative and late gaze effects are additive, the relative goodness of fit for an additive model should decrease as more data are removed from the end of the trial. To test this idea, we compared the relative model fit of an additive vs. multiplicative models in the raw data, and for data in which successively larger epochs were removed from the end of the trial (50, 100, 150, 200, 300, and 400ms). The relative fit was assessed by computing the relative probability that each model accurately reflects the data. In addition, to identify significant differences in goodness of fit, we compared the WAIC values and their standard errors for each model (Supplemental File 3). As shown in Figure 4, the relative fit probability for both models is nonzero in the raw data 0 truncation), indicating that a neither model provides a definitive best fit, potentially reflecting a mixture of the two processes. However, the relative fit of the additive model decreases sharply as data is removed, reaching zero at 100ms truncation. 100ms is also the point at which multiplicative models provide a significantly better fit, indicated by non-overlapping standard error intervals for the two models (Supplemental File 3). Together, this suggested that the transition between early- and late-stage gaze effects likely occurs approximately 100ms before the RT.

      To minimize the influence of post-decision gaze effects, the main results use data truncated by 100ms. However, because 100ms is only an estimate, we repeated the main analyses over truncation values between 0 and 400ms, reported in Figure 6 - figure supplement 1 & Figure 7 - figure supplement 1. These show significant gaze duration biases and final gaze biases in data truncated by up to 200ms.

      Reviewer #2 (Public Review):

      Recommendation 1: The only real issue that I see with the paper is fairly obvious: the authors find that the last fixations are longer than the rest, which is inconsistent with a lot of the human work. They argue that this is due to the reaching required in this task, and they take a somewhat ad-hoc approach to trying to correct for it. Specifically, they take the difference between final and non-final, second fixations, and then choose the 95th percentile of that distribution as the amount of time to subtract from the end of each trial. This amounts to about 200 ms being removed from the end of each trial. There are several issues with this approach. First, it assumes that final and non-final fixations should be the same length, when we know from other work that final fixations are generally shorter. Second, it seems to assume that this 200ms is "the latency between the time that the subject commits to the movement and the time that the movement is actually detected by the experimenter". However, there is a mismatch between that explanation and the details of the task. Those last 200ms are before the monkey releases the middle lever, not before the monkey makes a left/right choice. When the monkey releases the middle lever, the stimuli disappear and they then have 500ms to press the left or right lever. But, the reaction time and fixation data terminate when the monkey releases the middle lever. Consequently, I don't find it very likely that the monkeys are using those last 200ms to plan their hand movement after releasing the middle lever.

      Thanks for the opportunity to clarify these points. There are three related issues:

      First, with regards to fixation durations, in the updated Figure 3 we now show durations as a function of both the absolute order in the trial (first, second, third, fourth, etc.) and the relative order (final/nonfinal). We find that durations decrease as a function of absolute order in the trial, an effect also seen in humans (see Manohar & Husain, 2013). At the same time, while holding absolute order constant, final fixations are longer than non-final fixations. To explain the discrepancy with human final fixation durations, we note that monkeys make many fewer fixations per trial (~2.5) than humans do (~3.7, computed from publicly available data from Krajbich et al., 2010.) This means that compared to humans, monkeys’ final fixations occur earlier in the trial (e.g., second or third), and are therefore comparatively longer in duration. Note that studies with humans have not independently measured fixation durations by absolute and relative order, and therefore would not have detected the potential interaction between the two effects.

      Second, the comment suggests that the final 200ms before lever lift is not spent planning the left/right movement, given that the monkeys have time after the lever lift in which to execute the movement (400 or 500ms, depending on the monkey). The presumption appears to be that 400/500ms should be sufficient to plan a left/right reach. However, we think that these two suggestions are unlikely, and that our original interpretation is the most plausible. First, the 400/500ms deadline between lift and left/right press was set to encourage the monkeys to complete the reach as fast as possible, to minimize deliberations or changes of mind after lifting the lever. More specifically, these deadlines were designed so that on ~0.5% of trials, the monkeys actually fail to complete the reach within the deadline and fail to obtain a reward. This manipulation was effective at motivating fast reaches, as the average reach latency (time between lift and press) was 165 SEM 20ms for Monkey K, and 290 SEM 100ms for Monkey C.

      Therefore, given the time pressure imposed by the task, it is very unlikely that significant reach planning occurs after the lever lift. In addition to these empirical considerations, the idea that the final moments before the RT are used for motor planning is a standard assumption in many theoretical models of choice (including sequential sampling models, see Ratcliff & McKoon 2008, for review), and is also well-supported by studies of motor control and motor system neurophysiology. Based on these, we think the assumption of some form of terminal NDT is warranted.

      Third, we have changed our method for estimating the NDT interval. In brief we sweep through a range of NDT truncation values (0-400ms) and identify the smallest interval (100ms) that minimizes the contribution of “additive” gaze effects, which are thought to reflect late-stage, post-decision gaze processes. See the response to Point 4 for Reviewer 1 above, Figure 4 and lines 267-325 in the main text. In addition, we report all of the major study results over a range of truncation values between 0 and 400ms.