5,619 Matching Annotations
  1. Aug 2022
    1. Author Response

      Reviewer #1 (Public Review):

      Tarasov and colleagues provide data that extensively phenotypes TGAC8 mice, which exhibit a cAMP-mediated increase in cardiac workload prior to developing heart failure. The authors confirm data from prior studies, showing increased cardiac output mediated by changes in heart rate with similar ejection fraction. 

      The above is slightly incorrect as stated. Our results section stated that HR and EF were increased in TGAC8, but that stroke volume did not differ by genotype. Thus 30% increase in cardiac output in TGAC8 was attributable to the increased HR.

      The study is overall well-planned and the amount of data presented by the authors is impressive. The work nicely incorporates animal-level physiology (echocardiography data), tests for known canonical markers of hypertrophy, and then delves into an unbiased analysis of the transcriptome and proteome of LV tissue in bulk. The techniques and analyses in the study are adequately executed and within the realm of expertise of the Lakatta laboratory. This study is a necessary and crucial first step to extensively phenotype this mouse line and generate hypotheses for further work. 

      Reviewer #2 (Public Review): 

      Tarasov et al. present an impressive amount of work in their in-depth assessment of a murine model of chronic stress in a transgenic line with constitutively active AC/cAMP/PKA/Ca2+ signaling that spans cardiac structure, function, cellular architecture, gene and protein expression, mitochondrial function, energetics and more. Exploration of multiple cellular pathways throughout the manuscript and as summarized in Figure 16 help characterize this murine model and serves as a first step in using this model to understanding the effect of chronic stress on the heart. The conclusions of the manuscript are well-supported by the data, and I have the following comments: 

      Strengths: 

      1. The authors present echocardiographic, histologic, electrocardiographic, neurohormonal quantification, protein synthesis/degradation, mitochondrial, gene and protein expression profiling, and metabolism data in their assessment of this model. 

      2. The verification of increased transcripts of AC and PKA activation in this transgenic line provided validation for the model. 

      3. The pathway analyses for both gene and protein expression profiling help supports the authors' claim of the importance of differences noted in the various pathways between the transgenic line and controls. 

      4. The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. 

      As written, this statement does not exactly reflect what we had intended to communicate in the paper. We did not posit, that LV wall stress was reduced in TGAC8, but that it must be reduced compared to WT on the basis of Laplace’s Law because of a substantial reduction of LV cavity volume. We also did not posit that energy production is due to a shift in metabolism, but rather, that adaptations in energy metabolism resulted in adequate energy production to meet, what appeared to us to be a marked increase in energy demand in TGAC8 vs WT, based on our observation that transcriptome and proteome gene ontology (GO) terms that differed in TGAC8 vs WT, covered nearly all biological processes and molecular functions within nearly all compartments of the LV myocardium.

      These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. 

      Although the chronic increase in cardiac output in TGAC8 heart simulates that of an athlete’s heart during exercise, LV cavity volume at rest is larger in the endurance trained heart and this is associated with bradycardia. In these aspects, the TGAC8 heart differs from the endurance trained heart (perhaps because it does not have sufficient rest periods between bouts of exercise, as does the endurance trained heart). In the discussion section of the manuscript, we noted several features that differed between the TGAC8 vs the endurance trained heart. 

      However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      Heart failure in the long run, had first been described in the TGAC8 mouse by Mougenot et. al. (ref 10 in our manuscript) who performed numerous biochemical and biophysical measurements in TGAC8 and WT attributed the heart failure to be a manifestation of accelerated heart aging. We are in the midst of conducting a longitudinal study of cardiac structure and function in the TGAC8 vs WT as these mice age, along with additional non-biased multi-omics analyses in order to get an overview about which of adaptive pathways that are activated in TGAC8 heart at 3 months of age become faltered with advancing age and how changes in these pathways relate to the altered cardiac structure and function of the TGAC8 as age advances. Following that, we will focus on each of these pathways employing detailed mechanistic analyses. Our provisional hypothesis is that while AC8 activity will continue to be increased as age advances, its downstream signaling will begin to fail due to age-associated changes in proteostasis and in the expression of proteins, including those involved in energy metabolism.

      Weaknesses: 

      1.  As acknowledged by the investigators, this is a hypothesis-generating rather than hypothesistesting study. 

      Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      2.  The investigators posit that there is decreased wall stress and adequate energy production due to a shift in metabolism. These findings would suggest that this model would be suitable for that of an athlete's heart, which is characterized by thickened left ventricular walls without a compromise in function. However, the mice do develop heart failure after 1 year without a sense of mechanism despite the wealth of data provided. Are the authors able to comment on what changes described in this study of this transgenic line may be deleterious in the long run? 

      We have addressed these comments above in our response to your comment #4 under strengths.

      3.  Figure 5B is referenced to support the claim regarding beta adrenergic receptor desensitization, but the data show catecholamine levels in tissue. I would have expected receptor expression analysis to suggest up/downregulation of receptors at the membrane to support this claim. 

      Beta adrenergic receptor desensitization can occur due to changes in molecules that inhibit signaling that are at the receptor or at the signaling downstream of the receptor in the absence of changes in receptor number. Here is how we summed this up in our manuscript:  “Numerous molecules that inhibit βAR signaling, (e.g. Grk5 by 2.6 fold in RNASEQ and 30% in proteome; Dab2 by 1.14 fold in RNASEQ and 18% in proteome; and β-arrestin by 1.2 fold in RNASEQ and 14% in proteome) were upregulated in the TGAC8 vs WT LV (Table S.3, S.5 and S.9), suggesting that βAR signaling is downregulated in TGAC8 vs WT, and prior studies indicate that βAR stimulation-induced contractile and HR responses are blunted in TGAC8 vs WT.8,11… A blunted response to βAR stimulation in a prior report was linked to a smaller increase in L-type Ca2+ channel current in response to βAR stimulation in the context of increased PDE activity.13, 14 WB analyses showed that PDE3A and PDE4A expression increased by 94% and 36%, respectively in TGAC8 vs WT, whereas PDE4B and PDE4D did not differ statistically by genotype (Figure 16-supplement 1 A). In addition to mechanisms that limit cAMP signaling, the expression of endogenous PKI-inhibitor protein (PKIA), which limits signaling of downstream of PKA was increased by 93% (p<0.001) in TGAC8 vs WT (Table S.3). Protein phosphatase 1 (PP1) was increased by 50% (Figure 16-supplement 1 A). The DopamineDARPP-32 feedback on cAMP signaling pathway was enriched and also activated in TGAC8 vs WT (Figure 15), the LV and plasma levels of dopamine were increased, and DARPP-32 protein was increased in WB by 269% (Figure 16-supplement 1 A).

      Thus, mechanisms that limit signaling downstream of AC-PKA signaling (βAR desensitization, increased PDEs, PKI inhibitor protein, and phosphoprotein phosphatases, and increased DARPP32, cAMP (dopamine- and cAMP-regulated phosphoprotein)) are crucial components of the cardio-protection circuit that emerge in response to chronic and marked increases in AC and PKA activities (Figure 4 C, F).” 

      4. Changes in ion channel (e.g. KCNQ1 and KCNJ2) gene and protein expression were described but not validated by assessment of change in function. 

      Reviewer #3 (Public Review): 

      Tarasov et al have undertaken a very extensive series of studies in a transgenic mouse model (cardiomyocyte-specific overexpression of adenylyl cyclase type 8) that apparently resists the chronic stress of excessive cAMP signaling for around a year or so without overt heart failure. Based on the extensive analyses, including RNAseq and proteomic screening, the authors have hunted for potential "adaptive" or "protective" pathways. There is a wealth of information in this study and the experiments appear to have been carefully performed from a technical viewpoint. Many interesting pathways are identified and there is plenty of information where additional experiments could be designed. 

      General comments 

      1. Ultimately, this is a descriptive and hypothesis-generating study rather than providing directly proven mechanistic insights.

      As noted in response to Reviewer #2: “Yes, we used a systems approach at first, in order to “open our eyes” so that we could get an overview of numerous changes that might have occurred in the TGAC8 heart in order to generate hypotheses that could later be tested by others and by us.”

      -Given several prior studies reporting a detrimental effect of chronically increased cAMP signaling, what is it that is different in this model? Is it something specific about AC8? Is it to do with when (in life) the stress commences? 

      We believe it is, at least in part, due to something specific about the effects of the marked increased activity of AC8 perse, because adenylyl cyclase singling impacts nearly all aspects of our current knowledge of cell biology. Thus, due to the marked increase of AC and PKA activation in the TGAC8 heart, the transcriptome and proteome gene ontology (GO) terms that differ in TGAC8 vs. WT covered nearly all biological processes and molecular functions within nearly all compartments of the TGAC8 LV myocardium.

      - Is the information herein relevant to stress adaptation in general or is it just something interesting in this specific mouse model?

      In our opinion, AC8 mouse model is very relevant to stress adaptation in general, but this broad view has hardly ever been realized previously in the literature, because of the reductionist nature (by necessity) of mainstream biomedical research. For example, reports on cardiac specific overexpression of AC5 and AC6 never provided broader view on these mice and were focused only on a limited number of traits i.e., arrhythmogenesis, chronic pressure overload, contraction (Am J Physiol Heart Circ Physiol. 2015 Feb 1;308(3):H240-9; Am J Physiol Heart Circ Physiol. 2010 Sep;299(3):H707-12; Clin Transl Sci. 2008 Dec;1(3):221-7; Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9986-90; Am J Physiol Heart Circ Physiol. 2013 Jul 1;305(1):H1-8). 

      None of the pathways that are apparently activated were directly perturbed so their mechanistic role requires further study.

      We agree and have entitled a section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results” to address this plainly.

      Specific 

      1. The strain of the mice and their sex needs to be stated as well as the exact age at which the various assays were performed.

      All assays were performed on 3-month-old males. This information was inadvertently not directly stated in the original submission.  

      2. The hearts of the Tg mice have more cardiomyocytes but which are smaller. Since there is no observed increase in proliferation of cardiomyocytes, how (or when) did this increase in cell number occur?   

      It is likely that an increase in number of cardiomyocytes may have occurred during the embryonic stage of development (8.5 dpc), when AC8 expression begins. Since submitting our manuscript we have found that the expression level of human AC8 (the type of AC8 employed in this transgenic model) increases markedly during the embryonic period when compared to endogenous AC8 and remains elevated in both the fetal and perinatal periods. 

      3. While the mice do not show an increased mortality up to 12 months of age, HR/CO/EF are poor indices of contractile function. Data on end-systolic elastance or perhaps echo-based LV strain indices which will be relatively load-independent would be useful.

      Numerous comprehensive hemodynamic measurements have been performed previously on this mouse. For example, Mougenot et. al (Ref 10 in our manuscript), based on invasive hemodynamics analysis concluded that contractile function in the TGAC8 heart was increased at both 2 and 12 months of age. But Doppler imaging of the heart in conscience mice, unmasked, myocardial dysfunction, informed by a reduction in systolic strain rate in both old TGAC8 and WT littermates. This is why they attributed the heart failure in TGAC8 at 12 months of age to be a manifestation of accelerated aging.

      We agree with your comment that end-systolic elastance ought to be measured in the TGAC8 but also end-diastolic elastance, and effective arterial elastance should be measured in order to quantify diastolic function and heart energetic coupling in the TGAC8.  

      4.  Quite a lot of conclusions are made relating to metabolism. However, this is entirely based on gene expression or protein levels. Given the substantial role of allosteric regulation in metabolic control, as well as the interconnectedness of metabolic pathways, ultimately any robust conclusions need to be based on an assessment of activity of key pathways. 

      We concur and have described some of the types of metabolic assessments in the last section of our discussion “Opportunities for Future Scientific Inquiry Afforded by the Present Results”: “… precisely defining shifts in metabolism within the cell types that comprise the TGAC8 LV myocardium via metabolomic analyses, including fluxomics.97 It will be also important that future metabolomics studies elucidate post-translational modifications (e.g. phosphorylation, acetylation, ubiquitination and 14-3-3 binding) of specific metabolic enzymes of the TGAC8 LV, and how these modifications affect their enzymatic activity”.

    1. Author Response

      Reviewer #1 (Public Review):

      In their manuscript, these authors present a novel geostatistical framework for modelling the complex animal-environment-human interaction underlying Leptospira infections in a marginalised urban setting in Salvador, Brazil.

      In their work, the authors combine human infection data and the rattiness framework of Eyre et al. (Journal of the Royal Society Interface, 2020) . They use seroconversion defined as an MAT titer increase from negative to over 1:50 or a four-fold increase in titer for either serovar between paired samples from cohort subjects. Whereas this is a commonly used measure of infection; the work would benefit from answering the question about how robust results are related to this definition of seroconversion.

      Thank you for your comment. We have acknowledged this on line 534 in the discussion by adding the following text: “A possible limitation of this study is the titre rise cut-off values used for classifying seroconversion and reinfection in the cohort that determine the sensitivity and specificity of the infection criteria. However, these criteria were used because they are the standard definitions for serological determination of infection that are commonly applied for leptospirosis and a wide range of other infections, and they enable the comparison of results with other previous leptospirosis studies.”

      The model framework relies on the concept of 'rattiness' previously defined by Eyre et al. (JRSI, 2020) and assumes conditional independence within its built up (equation (1)). Whereas this is a reasonable assumption, it would be good to discuss situations in which this assumption is questionable and what the implications are for applying the modelling framework to other settings.

      We have added the following text immediately after “is shown schematically in Figure 2” following equation (1) on line 225: “The conditional independence assumption in (1) is reasonable for a vector-borne disease or one that is transmitted indirectly, in which context the observed rat indices are to be considered as noisy indicators of the unobservable spatial variation in the extent to which the environment is contaminated with rat-derived pathogen. It would be more questionable for applications in which the disease of interest is spread by direct transmission from rat to human.”

      The authors provide an extensive model building exercise and investigate, in different ways, whether the model captures the necessary complexity (GAM smoothers - testing linearity, spatial correlation, etc). I believe the work would benefit from (1) a formal diagnostic investigation, if feasible; (2) providing guidelines on how model building should be performed.

      We have added a new Appendix 7 with diagnostic plots of randomized quantile residuals to check the rattiness-infection model fit with the human infection data and included the following text in Section 2.4 of the main text: “A formal diagnostic investigation of randomized quantile residuals is included in Appendix 7. We found no evidence in the diagnostic plots to suggest that there were issues with our modelling approach.”

      To supplement the R code that is publicly available for repeating all of the steps in this analysis, we have now also included a detailed step-by-step explanation of the model building process in Appendix 8 that outlines the key steps for building the rat and infection components of the model (variable selection and evaluation of residual spatial autocorrelation) and fitting and examining the joint rattiness-infection model. We have added the following text in Section 2.6 of the main text: “We also include a step-by-step explanation of the model building process to guide future users of the rattiness-infection framework in Appendix 8.”

      The authors are to be acknowledged for providing an extensive and thorough discussion of the different aspects of their work. Whereas the discussion is complete, I wonder whether the authors can give a brief example about how this model can be applied in a different setting.

      Thank you. We have added the following text on line 551 in the discussion: “The framework may have important applications beyond the study of zoonotic spillover, with the rattiness component replaced by other exposure measures e.g. mosquito density or ecological indices (such as pollution, where there are multiple, related measures of air or groundwater quality) to model associations with human or animal health outcomes.”

      Reviewer #2 (Public Review):

      Eyre et al. developed and applied a novel geostatistical framework for joint spatial modeling of multiple indices of pathogen (Leptospira) reservoir (rats) abundance and human infection risk. This framework enabled evaluation of infection risk at a fine spatial scale and accounted for uncertainty in the pathogen reservoir abundance estimates. The authors used data collected in two different field projects: (1) a rat ecology study in which three different approaches were used to detect rat presence "rattiness", and (2) a prospective community cohort study in which individuals were sampled during two different time periods to detect recent infections via seroconversion or a four-fold increase in anti-Leptospira antibody MAT titer. Univariable and then multivariable analyses were performed on these data to identify (1) the environmental variables that best predicted "rattiness", and (2) the demographic/social, environmental (household), occupational, and behavioral variables that best predicted human risk of infection. Once identified, the best predictors from (1) and (2) were included in a final, joint model to identify the significant predictors of both 'rattiness' and human infection risk. As a result of this study, the authors were able to detect spatial heterogeneity in leptospiral transmission to humans. They found that infection risk associated with increases in reservoir abundance differed by elevation, and that increases in reservoir abundance at high elevation were associated with a much higher odds ratio for infection than at low elevation. The authors suggest that this has to do with differences in how the infectious leptospires (shed by the rat reservoir) are dispersed in the environment. At high elevations, flooding is less frequent and thus rat shed leptospires are likely to stay where the rat deposited them. Whereas at lower elevations, flooding may play a large role in spreading leptospires more evenly across the landscape, reducing the importance of rat presence at smaller spatial scales. The final best model was then used to generate prediction maps of 'rattiness' as well as human infection risk at all locations within the study area (i.e. including those that lacked rat detection data and human infection data. This work represents an important advance in infection risk modeling as it explicitly incorporates estimates of reservoir abundance and the uncertainty surrounding these estimates into the infection risk assessment, and allows for modeling of infection risk at fine spatial scales. Findings from this study have important management implications at the authors' study site as it suggests that interventions directed at high elevations should be different from those designed to address infection risk at lower elevations. However these are broader implications, as this novel approach may be applied to other systems to enable identification of differences in infection risk for other pathogens at a fine spatial scale, predict infection risk more broadly, and facilitate intervention strategies targeted for the specific epidemiological and ecological conditions experienced by a population.

      This was a well-designed study. The field sampling approach was well balanced, well described and appropriate. Broadly the modeling framework is appropriate for the questions being asked and for the data being used. The variable and model selection approaches were clearly described and appropriate. Evaluation of the more detailed mathematical approach is outside of my area of expertise, so I am unable to comment on the validity of the approach.

      For the most part, the explanatory variables assessed in the different models were well described and justified, however there were some cases for which further explanation would have been helpful. For example, how did the authors determine which occupations to evaluate? Specifically, why traveling salesperson? What is the difference between open sewer within 10 m and unprotected from sewer?

      We have added the following additional text to Section 2.3.2 on line 297 to clarify the definition and reason for inclusion for these variables: “In the household environment domain, two variables were used to capture risk due to sewer flooding close to the household: i) the presence of an open sewer within 10 metres of the household location and ii) a binary `unprotected from open sewer' variable which identified those households within 10 metres of an open sewer that did not have any physical barriers erected to prevent water overflow. Three high-risk occupations were included in the occupational exposures domain as binary variables. Construction workers and refuse collectors have direct contact with potentially contaminated soil, building materials and refuse in areas that provide harbourage and food for rats. Travelling salespeople have regular and high levels of exposure to the environment (particularly during flooding events) as they move from house to house by foot. Two other binary occupational exposure variables were included that measured whether a participant worked in an occupation that involves contact with floodwater or sewer water.”

      I also had some concerns regarding the time-period of the rat ecology study used to determine abundance, potential fluctuations in rat abundance through time, and how this might align with sampling to detect infection in humans. Depending on the time scale of population fluctuation in rats as well as fluctuations in infection prevalence in rats, the abundances calculated from data from the ecology study may not be accurately reflecting true abundance (and therefore shedding and transmission risk) during the time period that a human may have been exposed. However, the authors do a nice job of addressing some of these issues in the discussion. They mention that infection prevalence in rats is consistently around 80% and that there don't appear to be seasonal fluctuations in human exposure risk in the study area.

      Thank you.

      Reviewer #3 (Public Review):

      The goal of the authors was to test how important local rat abundance is as a driver of Leptospira infection in humans.

      The authors approached this using a strong combination of datasets on human infection risk and rat abundance, across a spatial scale that is large enough to allow simultaneous assessment of multiple potentially important drivers of infection risk. This further enables the authors to develop infection prediction maps based on the fitted models.

      This study design is a major advance towards understanding link between rat abundance and human infection risk.

      Based on the top models tested in the study, the authors conclude that local rat abundance is indeed correlated with infection risk, and that this correlation is strongest at higher elevation.

      This is an impactful finding, but in my opinion it is not yet clear how robust and important this is, because of two reasons:

      (1) The infection risk data: while the actual infection risk data are not shown, the map shown in Figure 5B suggests that there is an infection hotspot that happens to be at high elevation. This raises the question of how strongly this single hotspot is driving the observed correlation between rat abundance and infection risk (which the authors find to be much stronger at high elevation than at lower elevations).

      We have added a new figure (Figure 4) earlier on in the article (we decided to add this here rather than to Figure 6 - formerly Figure 5 - to ensure that the map is large enough that points in Figure 4A are easily visible – please note that it is included as a larger and easier to view image in the main eLife template version) with the raw infection data overlaid on contour lines for the three elevation levels to provide the reader with a better overview of the raw data. This new Figure 4 shows that out of a total of 403 participants in the high elevation region there were 16 infections, of which only 5 (31%) were located in the large hotspot in Valley 3 (valleys are numbered 1 to 3 from west to east, see Figure 1A). In addition to the largest hotspot in the north of Valley 3, there are several other areas in the high elevation region with raised predicted infection risk values relative to their surroundings where there were also rattiness hotspots and infected participants in the raw data: fives cases (red and yellow infection risk areas in Figure 5B) on the western side of Valley 2; the two cases on the eastern edge of Valley 2; the two cases on the western edge of Valley 3; and the single case in the southwest of Valley 3. Other variables are also important drivers of infection risk and at several of these locations the contribution of rattiness increases infection risk significantly relative to the low-risk surrounding area (e.g. to 10% in areas where risk is closer to 1% or 2%) without reaching the more obviously visible high infection risk values closer to 20%. We believe that our statistical model provides a better test of whether there is a statistical association between rattiness and infection at high elevations than a visual examination, but that this is supported by the large number of observations in the high elevation area (403) and the distribution of infected and uninfected households, which demonstrates that the observed association is not only driven by the hotspot in Valley 2.

      (2) The statistical models: if I understand correctly, all tested models of infection risk include the variable rat abundance, and while the individual effect estimates for rat abundance are statistically significant (Table 3), the more important question of how the fit of a model without the rat abundance variables compares with those of the other tested models (shown in Supplementary Table S2) has not been addressed.

      These models were considered but were ranked outside of the top five models and for this reason were not reported in Table S2. We agree that showing the AIC of a model without rattiness in this table can more clearly demonstrate the improved fit of the model with rattiness. To do this we have added the highest ranked model without rattiness (M) to Table S2 and added a note to the table explaining the reason for its inclusion (“Model M was ranked outside of the top 5 models but is included here for reference to demonstrate the improvement in model fit when rattiness is included”). The AIC of M* was 532.13. This is substantially higher than the top five models (M1 = 523.14 and M5 = 525.04), justifying its inclusion in this model and in the joint rattiness-infection framework.

      Regardless of whether rat abundance is an important driver of human infection risk, this study is a major step in our understanding of the role of rats in the spread of leptospirosis, due to the strong combination of a unique combination of datasets and a spatial statistical modeling approach.

      Thank you.

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript discusses evolutionary patterns of manipulation of others' allocation of investment in individual reproduction relative to group productivity. Three traits are considered: this investment, manipulation of others' investment, and resistance to this investment. The main result of the manuscript is that the joint evolution of these traits can lead to the maintenance of diversity through, as documented here, cyclic (or noisier) dynamics. Although there are some analytical results, this main conclusion is instead supported by individual-based simulations, which seem correctly performed (but for clonal populations, as emphasized below).

      There could be material for a good paper here but the organization of the manuscript makes it difficult to fully evaluate. The narrative is highly condensed, with the drawbacks that this often entails in terms of accurately conveying the results of a study, as illustrated here by the following issue.

      The population is apparently assumed to be clonal (more than just "haploid"), meaning that there is no recombination between the loci controlling the three traits. In the one case where this assumption is relaxed (quite artificially), the cyclic dynamics disappear (section 4.4 of the appendix). This is crucial information that cannot be appreciated in the main text.

      The paragraph at line 368 offers a simple explanation for the joint dynamics of traits. However, this explanation would hold identically for a sexual population and a clonal population, whereas these two cases seem to have completely different dynamics. Thus, there is something essential to explain these differences, that is missing from the given explanation.

      Yes, our model was asexual with no recombination. To address this comment, we carried additional simulations where recombination was allowed (Appendix 1— 4.8). We found that recombination does not change our results (predictions), and describe this on line 469-475. By assuming additive effects of traits and each traits having the same dispersal property, our haploid asexual model is also equivalent to a diploid sexual model (Taylor 1996; Day & Taylor 1998).

      This is especially important because the finding that the joint evolution of several traits can lead to some form of diversity maintenance is not surprising. As the discussion acknowledges (but the introduction seems to downplay), it is also well understood that manipulation and counter-adaptations to it can occur in many contexts and lead to the maintenance of diversity. For this reason, similar results in the present case are not surprising, and the main outcome of the study should be to provide a deeper understanding of the forces leading to the different outcomes in the current models.

      I do not see clearly what distinguishes "manipulative cheating" from other forms of manipulations that have been previously discussed in the literature (e.g, as cited lines 461). Couldn't this be clarified by some kind of mathematical criterion?

      Thanks for pointing out that there is room to improve the distinction between our model and previous models! We have added more description to explain the conceptual difference on line 187-193, and a new subsection in appendix to show these differences through mathematically examine the fitness formulations in previous models (Appendix 1—1.3).

    1. Author Response

      Reviewer #1 (Public Review):

      This paper addresses an important question: whether the conduction velocity in white matter tracts is related to individual differences in memory performance. The authors use novel MRI techniques to estimate the "g-ratio" in vivo in humans - the ratio of the inner axon relative to the inner axon plus its outer myelin sheath. They find that autobiographical recall is positively related to the g-ratio in a specific white matter tract (the parahippocampal cingulum bundle) in a population of 217 healthy adults. This main finding is extended by showing that better memory is associated with larger inner axon diameters and lower neurite dispersion, which suggests more coherently organised neurites. The authors also argue that their results show that the magnetic resonance (MR) g-ratio can reveal novel insights into individual differences in cognition and how the human brain processes information.

      The study is exploratory in nature and the analyses were not pre-registered. The technique has not been used before to associate cognitive performance with MR estimates of conduction velocity in candidate white matter tracts. It is therefore unknown how strong any associations are likely to be and what sort of sample size might be needed to observe them. Nevertheless, if the technique proves to be reliable, then it certainly offers a valuable new tool to understand individual differences in cognitive abilities. However, brain structure to behavior associations are notoriously variable across studies and have been argued to require very large sample sizes to obtain reproducible results.

      We respectfully disagree that the study was exploratory. We had distinct aims and hypotheses from the outset. Our prime interest is in autobiographical memory, the hippocampus and its connectivity. This motivated our focus on three specific white matter tracts. We also planned from the time of study design to examine the MR g-ratio, and even contributed to refining the pre-processing pipeline for this approach, as reported in a previous paper (Clark et al., 2021, Frontiers in Neuroscience). Moreover, in the current manuscript we outlined well thought through possible outcomes and declared specific predictions.

      Regarding pre-registration, due to the scope of this work, the experiment was planned eight years ago, and data collection commenced seven years ago. At that time, formal pre-registration was not common practice. However, it has been a long-standing feature of our Centre that proposed studies and their analysis plans undergo rigorous internal peer review, including presentation to the whole Centre, before data acquisition can commence. The proposal for the research under consideration here was presented on 26th September, 2014.

      As noted in our response to the Editors’ Public Evaluation Summary above, someone has to be the first to report a novel result, and we believe that the depth and transparency of our approach permits confidence in the findings. Not least, and to reprise, because we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Our primary analyses were performed using the behavioural outcome measure from this test, the results of which were directly compared to those from a closely-matched control measure to test whether significantly larger effects were observed for our variable of interest. The potential for false positives was further reduced by extracting microstructure data from hypothesised tracts of interest (instead of performing whole brain voxel-wise analyses), with statistical correction performed on all structure-behaviour analyses. Moreover, we performed partial correlations with age, gender, scanner and number of voxels in a region of interest (ROI) as covariates. Complementary investigations were also conducted using other commonly-reported measures, providing supporting evidence. We report all analyses (and provide all the source data), including those finding no relationships. The consistent results throughout were associations between autobiographical memory recall ability and the microstructure of the parahippocampal cingulum bundle only. Moreover, thanks to the excellent suggestions of the Reviewers, the revised version reports additional analyses that allow us to further corroborate and interpret our findings.

      Our sample of 217 participants allowed for sufficient power to identify medium effect sizes when conducting correlation analyses at alpha levels of 0.01 and when comparing correlations at alpha levels of 0.05 (Cohen, 1992, Psychological Bulletin). While it has recently been suggested that thousands of participants are required in order to investigate brain structure-behaviour associations (Marek et al., 2022, Nature), other, more sophisticated, analyses suggest that samples of ~200 participants can be sufficient, in line with our estimates (Cecchetti and Handjaras, https://psyarxiv.com/c8xwe; DeYoung et al., https://psyarxiv.com/sfnmk). Given that our study was principled, well-controlled, analysed appropriately and produced very specific and consistent findings, we are confident that the findings are robust.

      The authors decided to analyse performance on a single task - the Autobiographical Memory Interview - and identified three candidate white matter tracts that connect the hippocampal region with other brain regions. While it is clear why these three tracts were chosen, it is less obvious why the authors chose to investigate associations with the Autobiographical Memory Interview and not other memory tests that were part of the battery of tests administered to the participants. It is reasonable to assume that something as general as the conduction velocity of a white matter tract would have an effect on memory ability across a range of tasks, so to single out one seems an unnecessarily narrow focus.

      Our main interest over many years, and hence the focus of this study, is autobiographical memory recall because it directly relates to how people function in real life. As noted above, autobiographical experiences occur in dynamic, multisensory, multidimensional, non-linear, ever-changing contexts; they involve actively engaging with the environment and other people; they are embodied; they span milliseconds to decades. Many of these features cannot be captured by laboratory-based episodic memory tests. This issue is increasingly being discussed (for example, see recent reviews by Nastase et al., 2020, NeuroImage; Mobbs et al., 2021, Neuron; Miller et al., 2022, Current Biology). It is further laid bare in McDermott et al.’s (2009, Neuropsychologia) meta-analysis of functional MRI studies which showed that laboratory-based and autobiographical memory retrieval tasks differ substantially in terms of their neural substrates. Consequently, we were not surprised to find that when we analysed laboratory-based memory test performance, there were no correlations with the MR g-ratio. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer for making the excellent suggestion to include these additional data, analyses and discussion points.

      The results of the study are interesting and highlight a key role of the parahippocampal cingulum bundle in autobiographical memory recall. The results are corrected for multiple comparisons across the three fiber tracts of interest and the recall of "external details" provides a nice control compared to the "internal details" which are the measure of interest. The main findings are extended to show that it is likely to be an increase in axon diameter and an increase in neurite coherency that characterize those individuals with better autobiographical recall. Despite these positives, it remains unclear whether memory recall, in general, is better in people with higher g-ratios in this tract (as implied in the Abstract), or if this effect is specific to scores on the Autobiographical Memory Interview.

      Our interest is in autobiographical memory, and so we employed the most widely-used and best-validated method of testing autobiographical memory recall that is currently available – Levine’s Autobiographical Interview. Not only does this test include a control measure, external details (as noted by the Reviewer), but we had independent raters score the autobiographical memory descriptions, and found that the inter-class correlation coefficients were very high (see Materials and Methods). Despite using this current, gold standard approach, at the request of the Reviewer we have now analysed data from eight additional laboratory-based memory tests. These are standard memory tests that are often used in neuropsychological studies: testing recall - the immediate and delayed recall of the Logical Memory subtest of the Wechsler Memory Scale IV, the immediate and delayed recall of the Rey Auditory Verbal Learning Test, the delayed recall of the Rey–Osterrieth Complex Figure; testing recognition memory - the Warrington Recognition Memory Tests for Words and Faces; testing semantic memory - the “Dead or Alive Test”. While these tests can assess some aspects of memory recall, they cannot be regarded simply as proxies for autobiographical memory recall, for the reasons we outlined in our response to the previous point. They do not capture key aspects of autobiographical memories. It is therefore all the more interesting that we found no associations between these laboratory-based memory tasks and the MR g-ratio of the parahippocampal cingulum bundle, in contrast to the relationship identified with autobiographical memory recall ability. Recall of vivid, detailed, multimodal, autobiographical memories may rely on inter-regional connectivity to a greater degree than simpler, more constrained laboratory-based memory tests. Therefore, as well as speaking to conduction velocity, these findings also contribute to wider discussions about real-world compared to laboratory-based memory tests. We thank the Reviewer once again for making the excellent suggestion to include these additional data, analyses and discussion points.

      Reviewer #2 (Public Review):

      In this study, Clark and colleagues tackle a very intriguing question: how differences in autobiographical recall abilities reflect in the human brain structure and function? To answer this question, they interviewed a large cohort of subjects and proceeded to acquire MRI data, specifically diffusion-weighted imaging and magnetization transfer data, to estimate the g-ratio, a measure of myelination deeply linked to conduction velocity. Looking at three specific white matter pathways of interest, all interconnecting the hippocampus with other brain structures, they studied the relationship between the g-ratio and the autobiographical recall abilities, together with many more measures from MRI. They found a significant positive association between the g-ratio of the parahippocampal cingulum bundle and the number of inner details from the interviews. These results can provide new potential directions to further study the underlying neural features beyond memory.

      I think that this is a very interesting article, it is well written, the methods are extensively explained, and the appendix provides further details for more expert readers. The authors put an effort into providing a comprehensive context in the introduction and in the discussion, and as a result, the paper seems overall quite suitable for both general and specialistic readerships.

      Thank you.

      The main issue I can currently see in the paper is that the mentioned relationship between g-ratio and recall abilities is then used to infer that better recall abilities are associated with higher conduction velocity and larger axons. The authors' line of reasoning is that given the hypothesized association, the increase in the g-ratio implies increases in myelin and axonal diameter. Despite this scenario being indeed possible given the current result, an increased g-ratio may also not indicate higher conduction velocity. In fact, the first potential inference would be that, without having any information on the axon size, the quantify of myelin can indeed be lower and as result, the conduction velocity would decrease. I understand that the authors expected higher conduction velocity associated with better autobiographical memory recall, but it is hard to see any experimental outcome that could have disproved this hypothesis: from the possible scenarios depicted in the introduction, any change in the g-ratio (and even not any change at all) could indicate higher conduction velocity. What would be then needed to corroborate one of these scenarios is some independent or complementary measure, which unfortunately is missing.

      The mentioned issue does not mean that the paper loses relevance - I think that it should focus on the very practical result, a change in myelination and microstructure, and discuss what are the potential implications, including the one that currently dominates the discussion section.

      Thank you for these comments and the opportunity to provide further clarification.

      First, we have now provided additional background information regarding the relationship between the MR g-ratio and conduction velocity. We explicitly note that while finding a significant relationship between the MR g-ratio and autobiographical memory recall suggests the existence of an association between autobiographical memory recall and parahippocampal cingulum bundle conduction velocity, it cannot speak to the direction of this association.

      Second, we have further noted that interpretation of the parahippocampal cingulum bundle MR g-ratio in relation to the underlying microstructure requires knowledge, or an assumption, about whether the associated change in conduction velocity is faster or slower. Given that faster conduction velocity is thought to promote better cognition (e.g. Brancucci, 2012; Dicke and Roth, 2016; Miller, 1994; Reed and Jensen, 1992), we interpreted our MR g-ratio findings under the assumption of faster conduction velocity, and now explicitly note in several places in the revised manuscript that this is an assumption.

      Third, we thank the Reviewer for the excellent suggestion that a complementary measure could help to further inform the findings. Consequently, we now also include additional analyses examining the relationship between the extent of myelination and autobiographical memory recall ability. This is possible using the magnetisation transfer saturation maps, which are optimised to assess myelination. Given our assumption of faster conduction velocity when interpreting our positive MR g-ratio correlations, then no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall would be expected. On the other hand, if conduction velocity is actually decreasing, then a negative correlation between magnetisation transfer saturation values and autobiographical memory recall ability would be observed. In fact, we found no relationship between parahippocampal cingulum bundle magnetisation transfer saturation and autobiographical memory recall. This suggests that myelin was not associated with autobiographical memory recall ability, supporting our assumption that relationships with the MR g-ratio were indicative of faster rather than slower, conduction velocity.

      We have now added these new data, analyses and discussion points to the revised manuscript.

      It would also be helpful to include some paragraphs on both interpretation and methodological issues when it comes to MRI-based microstructural imaging, which at the moment is lacking. This would provide a better picture of the results for a more general readership.

      We agree, and additional consideration of interpretational and methodological limitations have now been included in the manuscript.

      As one of the first works using an MRI-based microstructural measure of myelin, the g-ratio, to study cognition in a large cohort of subjects, I think this work will be a needed and significant step towards merging the neuroscience and MRI physics community - the methodology presented here is robust and could be used in many other applications.

      Thank you.

      Reviewer #3 (Public Review):

      The manuscript adds useful information about how structural properties of the brain are related to individual differences in autobiographical memory. A novel metric is used to assess features of white matter in tracts that are important for information exchange between the hippocampus and other brain regions. In one of these, the parahippocampal bundle, a relationship between the MR g-ratio and autobiographical memory recall is identified. This represents new and interesting information. The authors interpret the results in line with the theory that speed of signal transmission is important for cognitive function.

      Thank you for this positive summary.

    1. Author Response

      Reviewer #1 (Public Review):

      Rasicci et al. have developed a FRET biosensor that is designed to light up when cardiac myosin folds. This structure is extremely important to understand, and its link to the super-relaxed (SRX) state has not been fully shown. Their study provides a comprehensive review of the literature and provides compelling data that the 15 heptad+leucine zipper+GFP construct does function well and that the DCM mutant E525K has a similar IVM velocity despite a reduced ATPase compared with HMM. They rely on the ionic strength-dependent changes in the rate of MantATP release to argue that the E525K mutation stabilizes the 'interacting heads motif' (IHM) state, which makes logical sense.

      Strengths:

      Well written and comprehensive.

      Utilizes the appropriate fluorescence-based sensor for measuring the folding of the myosin structure. Provides a detailed range of techniques to support the premise of the study

      Weaknesses:

      Over-interpretation of the outcomes from this study means that the IHM and SRX are the same. Similar studies, e.g. Anderson 2018 and Chu 2021 support the opposite view that IHM and SRX are not necessarily the same, Anderson (and Rohde 2018) point out that S1 has some element of a reduced ATPase, this clearly cannot be due to folding of the molecule. Also, mavacamten was used in these studies to show that even S1 is inhibited suggesting that SRX and IHM are not connected. This is not to say that with enough supporting evidence that these observations cannot be over-ridden, it is just not clear that there is enough in this study to support this conclusion.

      We have revised our discussion to emphasize that our results support a model in which the SRX state is enhanced by formation of the IHM, but given the S1 and 2HP data the IHM may not be required for populating the SRX biochemical state (see page 8).

      I felt that the authors passed over the recent Chu 2021 paper too quickly, the Thomas group used a FRET sensor as well and provides a direct comparison as a technique, but with opposite conclusions. They also have supporting data in Rohde 2018 that their constructs were less ionic strength sensitive. It would be useful to understand what the authors think about this.

      We have discussed the Rohde and Chu papers in more detail in the discussion (see page 8). In the Rhode paper they used proteolytically prepared HMM and S1. Rohde found 20% SRX at all KCl concentrations in S1, while HMM shifted from 50% to 20% SRX in low and high salt conditions, respectively. Our results are different in terms of the absolute fraction of the SRX state but the trend is similar in terms of S1 being salt-insensitive and HMM being salt-sensitive. The difference could be proteolytic HMM, which is a longer construct, and proteolytic S1, which is prone to internal cleavage that can impact ATPase activity. Another difference could be the mixed isoform of mantATP used in previous studies and the single isoform of mantATP used on our study (see page 5)

      Reviewer #2 (Public Review):

      The paper by Rasicci et al. examines the impact of the DCM mutation E525K in beta-cardiac myosin on its function and regulation by autoinhibition. The role of the auto-inhibited state of beta-cardiac myosin in fine-tuning cardiac contractility is an active and exciting area of current research related to muscle biology and cardiomyopathies. Several studies in the past have linked the destabilization of the autoinhibited, super-relaxed (SRX) state of myosin to the pathogenesis of hypertrophic cardiomyopathy. This timely study provides one of the first few examples where the hypocontractile phenotype of a DCM mutation has been linked to the stabilization of the SRX state.

      One of the strengths here is the utilization of a wide variety of both pre-existing and novel biochemical and biophysical assays for the study. The authors have characterized a new two-headed long-tailed myosin construct containing 15-heptad repeats of the proximal S2 (15HPZ), which they show allows myosin to form the SRX state in vitro using single ATP turnover assays. The authors go on to compare the E525K and WT proteins using the 15HPZ myosin construct in terms of their steady-state actin-activated ATPase activity, in-vitro actin-sliding velocity and single ATP turnover measurements. These assays reveal that the predominant effect of this mutation is the stabilization of the SRX state which is maintained even at 150 mM salt concentration where the WT SRX is largely disrupted. This is an important observation because DCM mutations so far have been believed to only affect the force-generating capacity of myosin.

      One of the biggest strengths of this study is the attempt to develop a FRET-based approach to directly ask if the biochemical SRX state here correlates well with the structural IHM state, which is an important unresolved question in the field. The authors have designed a FRET pair (C-terminal GFP and Cy3ATP bound to the active site) that is sensitive to the relative position of the heads and the tail, allowing them to distinguish between the low-FRET closed IHM conformation and the no-FRET open conformation. Remarkably, the authors show that the salt dependence of the FRET efficiency values closely follows their results from the salt dependence of the percent SRX for both WT and E525K proteins. The authors then attempt to substantiate their FRET results by a direct visual analysis of the conformational states populated by both WT and E525K proteins at low salt using negative staining EM analysis. The authors have optimized conditions to allow the deposition of the IHM state on grids without adding the small molecule mavacamten, which was found to be necessary in an earlier study to visualize the closed state using EM. The authors conclude that the SRX state correlates well with the IHM state and that the E525K mutation indeed stabilizes the folded-back conformation of myosin.

      This study significantly strengthens the previously illustrated correlation between the SRX and IHM states and provides methodological advances (especially visualization of the IHM state by negative EM in the absence of cross-linking agents) that will be very useful to the field going forward. The observation that a DCM mutation can lead to stabilization of the folded back state is a novel insight that should spark interest in the field to test how broadly this applies to other DCM mutations. The conclusions of the paper are mostly supported by the data; however, some clarifications and qualifications are needed.

      Weaknesses:

      The extremely low enzymatic activity of the M2β 15HPZ myosins as compared to the WT S1 control (which is a historical control not assayed in parallel with the 15HPZ proteins), is concerning for the low protein quality of the 15HPZ myosins. The authors attribute the low kcat to the high proportion of SRX population in their ensembles. However, the DRX rates reported for the WT and E525K 15HPZ proteins in the single ATP turnover assay are ~3-4 fold lower than those of their S1 counterparts. These rates reflect basal turnover of ATP in the open state and thus should not be affected by the presence of the S2 tail, which leads to concerns about the 15HPZ protein activity. In addition, the very high percentage of stuck filaments in the in vitro motility assay for the 15HPZ constructs (despite the use of dark actin) is also concerning for significant amounts of enzymatically inactive protein.

      We thank the reviewer for pointing out the differences in the S1 and HMM DRX rates. We performed additional single turnover measurements with S1, adding two sets of measurements from one additional preparation (N=3), and we demonstrate that there is a significant increase in the DRX rates of WT S1 compared to WT HMM (see pages 4-5, Table 3, Figure 3- figure supplement 3). A faster rate in S1 was also reported in Rohde et al. 2018. Indeed, the DRX rates of E525K S1 are significantly higher than WT in S1, which we also now report in the results (see page 5, Figure 3 – figure supplement 3). We addressed the concerns about 15HPZ activity by performing NH4+ ATPase assays to demonstrate that the number of active heads was similar in S1 and 15HPZ HMM (see page 4). It is possible that the higher percentage of stuck filaments in the HMM motility is due to myosin heads in the IHM state on the motility surface, which generate a drag force by non-specifically interacting with actin, but further study is necessary to examine this question.

      The authors assert that the E525K mutation represents a new mechanism by which DCM-causing mutations lead to decreased contractility - by stabilizing the sequestered state rather than affecting motor function. However, there is no evaluation of the motor function (actin-activated ATPase activity or in vitro motility) of the E525K S1, which would reveal the effects of the mutation without confounding effects due to the sequestering of heads. Interestingly, in the single ATP turnover assay, the DRX rate of the E525K S1 is >2-fold higher than the WT control, suggesting that the mutation may have effects beyond stabilization of the SRX state. The conclusion that the E525K mutation's effect on myosin function is mediated via stabilization of the SRX state would be strengthened if the effects of the mutation on the motor domain alone were also known.

      We thank the reviewer for this suggestion. We performed actin-activated ATPase assays with WT and E525K S1 and found that E525K increases kcat and lowers KATPase, demonstrating enhanced intrinsic motor activity in the mutant S1 construct (see page 4, Figure 2B). This adds an interesting dimension to the manuscript because we report a mutant that enhances the intrinsic motor activity but stabilizes the SRX/IHM (see Discussion page 10). We did not perform in vitro motility, because this assay depends on the surface attachment strategy, and we would like to compare all constructs with the same attachment strategy using a C-terminal GFP tag (mutant and WT S1 and 15HPZ HMM). Therefore, we are making the S1 construct with a C-terminal GFP tag for this purpose, to be examined in a future study.

      While the authors show strong qualitative correlations between the SRX and IHM states using single ATP turnover, FRET, and EM experiments, attempts to quantitatively compare the fraction of heads in the IHM state using the various experimental approaches is problematic. For example, the R0 value of the FRET pair used here doesn't allow precise measurement of the distances being probed here to be made, but the distances are reported and compared to predicted distances. The authors report that the R0 for their FRET pair is 63 Å. Surprisingly the authors go on to use the steady-state FRET efficiency values to determine the average D-A distance (Fig 5B) which is 100 Å when all heads are in the IHM configuration and becomes larger than that when heads open. R0 of 63 Å allows a precise distance measurement to be made in the 31.5-94.5 Å range which corresponds to 0.5-1.5 R0. It is therefore technically incorrect to use the steady-state FRET efficiency values to determine the D-A distance here. Besides, there are several unknown factors here like orientation factor (κ2) which further complicate these calculations. Similarly, the quantification of IHM state molecules from the negative stain EM experiments is significantly hampered by the disruptive effect of the grid surface on the structure of the IHM state. The authors find that limiting the contact time with the grid to ~ 5s is necessary to preserve the IHM state.

      Despite that, only ~15% WT molecules were seen in the IHM state at low salt (Fig. 6B). In contrast, ~56% E525K molecules were in the IHM state. Both these proteins have similar SRX proportions (Fig. 3C) and similar FRET efficiency values (Fig. 5A) at this salt concentration. This mismatch highlights the problem arising due to not having a measure of the populations from the FRET data. It is not clear if the hugely different proportions of the IHM state in EM experiments are indicative of the relative stability of this state in the two proteins or a random difference in the electrostatic interactions of WT vs mutant with the grid. These experiments do not provide a correct idea of the %IHM in the two proteins. In the absence of any IHM population measurement, it is important to proceed with caution when quantitatively correlating the SRX and IHM.

      We thank the reviewer for pointing out that measuring precise distances by FRET can be difficult. We agree that the low FRET efficiency makes precise distance determination even more challenging. However, FRET is quite good at measuring a change in distance given a specific donor-acceptor pair. We feel our FRET biosensor clearly demonstrates FRET efficiencies that are salt-insensitive in E525K but a clear decrease in FRET at higher salt concentrations in WT. In order to compare the trend in the predicted FRET, based on the single turnover measurements, and the actual FRET we thought it was important to plot the two together on the same graph. We understand that this could have been misleading that we were reporting actual distances. We have now plotted the FRET efficiency instead of distance as a function of KCl concentration (Figure 5B), to prevent any confusion with reporting distances. In addition, we have emphasized that the data are plotted to allow for a comparison of the trend in the single turnover and FRET data (see page 6, 10, Figure 5B).

      We agree that it is important to proceed with caution when comparing the EM to the FRET and single turnover data. The EM does not give a quantitative estimate of the fraction of IHM molecules, due to the disruptive effect of the grid surface on protein conformation. However, it does provide direct (though qualitative) evidence that the conformation underlying SRX and enhanced FRET is the IHM, and it is consistent with our interpretation that the E525K mutation enhances FRET and SRX by stabilizing the IHM. To consolidate this result, we have performed EM experiments now with a total of 3 preparations of WT and mutant (see page 6-7 and Figure 6D). We find that while there is variability from experiment to experiment, likely because the grid surface is slightly different each time the experiment is performed, in all cases there was a ~4-fold higher fraction of folded molecules in the mutant. Since each WT/mutant experimental pair was studied in parallel, using identically prepared grids, the results provide further evidence that the mutant stabilizes the IHM. However, we agree that a quantitative, direct visual correlation of the SRX and IHM is not possible based on the current EM data.

      Finally, the utility of the methods described in the paper to the field would be greatly enhanced if they were described in more detail. As currently written, it would be difficult for others to replicate these experiments.

      Thank you for the comment. We have made significant changes in the methods to clarify the details of the experiments (see pages 11-14). In addition, we have added details to the results and figure legends.

    1. Author Response

      Reviewer #1 (Public Review):

      “This study investigates the dynamics of brain network connectivity during sustained experimental pain in healthy human participants. To this end, capsaicin was applied to the tongues of two cohorts of participants (discovery cohort, N=48; replication cohort, N=74). This procedure resulted in pain for several minutes. During sustained pain, pain avoidance/intensity ratings and fMRI scans were obtained. The analyses (i) compare the pain state with a resting state, (ii) assess the dynamics of brain networks during sustained pain, and (iii) aim to predict pain based on the dynamics of brain networks. To this end, the analyses focus on community structures of time-evolving networks. The results show that sustained pain is associated with the emergence of a brain network including somatomotor, frontoparietal, basal ganglia and thalamic brain areas. The somatomotor area of the tongue is particularly involved in that network while this area is decoupled from other parts of the somatomotor cortex. Moreover, the network configuration changes over time with the frontoparietal network decoupling from the somatomotor network. Frontoparietal-cerebellar connections were predictive of decreases of pain. Together, the findings provide novel and convincing insights into the dynamics of brain network during sustained pain.

      Strengths

      • The brain mechanisms of sustained pain is a timely and relevant topic with potential clinical implications.

      • Assessing the dynamics of sustained pain and relating it to the dynamics of brain networks is a timely and promising approach to further the understanding of the brain mechanisms of pain.

      • The study includes discovery and replication cohorts and pursues a cutting-edge analysis strategy.

      • The manuscript is very well-written and the results are visualized in an exemplary manner including a graphical outline and summary of the findings.”

      We thank the reviewer for the thoughtful summarization and evaluation of our study.

      “Weaknesses

      • It remains unclear whether the changes of brain networks over time simply reflect the duration of sustained pain or whether they essentially reflect different levels of pain intensity/avoidance.”

      We appreciate the editor and reviewer’s comment on this issue. With the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. That is, the pain duration is correlated with the pain intensity in our task.

      However, when we examined the time-course of the ratings at each individual level (as shown in Figure S2), the time duration explained 53.7% of the rating variance, R2 = 0.537 ± 0.315 (mean ± standard deviation). In addition, if we constrain the beta coefficient of the time duration to be negative (i.e., ratings should decrease over time), the explained variance decreases to 48.2%, R2 = 0.482 ± 0.457, leaving us enough variance (i.e., greater than 50%) for examining the distinct effects of time duration and ratings on the patterns of functional brain reorganization.

      Indeed, the two main analyses included in the manuscript—consensus community detection and predictive modeling—were designed to examine those two aspects of the task, i.e., time duration and pain avoidance ratings, respectively. First, through the consensus community detection analysis, we examined the community structure that changes over time, i.e., across the early, middle, and late periods (as shown in Figure 3). We then developed predictive models of pain avoidance ratings in the second main analysis (as shown in Figure 5).

      Though it is still a caveat that we cannot fully dissociate the effects of time duration versus pain ratings, we could interpret the first set of results to be more about time duration, while the second set of results is more about pain ratings.

      We now added a description of the implication of predictive modeling for isolating the effects of pain ratings. In addition, a discussion on the caveat of the current experimental design and relevant future direction.

      Revisions to the main manuscript:

      p. 25: Moreover, developing models to directly predict the pain ratings is helpful to complement the group-level analysis, because the changes in consensus community structure over the early, middle, and late periods only indirectly reflect the different levels of pain.

      p. 27: This study also had some limitations. First, with the current experimental paradigm, it is difficult to dissociate the pain duration from the level of pain because the delivery of oral capsaicin commonly induces initial bursting and then a gradual decrease of pain over time. Though we aimed to model the effects of pain duration and pain avoidance ratings with our two primary analyses, i.e., consensus community detection and predictive modeling, we cannot fully dissociate the impact of time duration versus pain ratings.

      “• Although the manuscript is very well-written it might benefit from an even clearer and simpler explanation of what the consensus community structure and the underlying module allegiance measure assesses.”

      We thank you for the suggestion. Now we added additional (but simple) descriptions of module allegiance and consensus community detection methods.

      Revisions to the main manuscript:

      pp. 8-9: Here, the consensus community means the group-level representative structures of the distinct community partitions of individuals. To determine the consensus community across different individuals and times, we first obtained the module allegiance (Bassett et al., 2011) from the community assignment of each individual. Module allegiance assesses how much a pair of nodes is likely to be affiliated with the same community label, and is defined as a matrix T whose element Tij is 1 when nodes i and j are assigned to the same community and 0 when assigned to different communities. This conversion of the categorical community assignments to the continuous module allegiance values allows group-level summarization of different community structures of individuals.

      p. 14: Here, high module allegiance indicates the voxels of two regions are likely to be in the same community affiliation, and vice versa.

      “• The added value of the assessment of the dynamics of brain networks remains unclear. Specifically, it is unclear whether the current analysis of brain networks dynamics allows for a clearer distinction between and prediction of pain and no-pain states than other measures of static or dynamic brain activity or static measures of brain connectivity.”

      The main goal (and thus, the added value) of the current study was to provide a “mechanistic” understanding of the brain processes of sustained pain, rather than the “prediction.” Even though we included the results from the predictive modeling, as in Figures 4-6, our focus was more on the interpretation of the model to quantitatively examine the functional changes in the brain, not on the maximization of the prediction performance.

      Indeed, maximizing the prediction performance was the main goal of our previous study (Lee et al., 2021), in which we developed a predictive model of sustained pain based on the patterns of dynamic functional connectivity. The model showed better prediction performances compared to the current study, but it was challenging to interpret the model because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      In this sense, the multi-layer community detection method has several advantages to achieving our goal. First, the community detection analysis allows us to summarize the complex, high-dimensional whole-brain connectivity patterns into neurobiologically interpretable subsystems. Second, the multi-layer community detection method allows us to study the temporal changes in community structure by connecting the same nodes across different time points.

      Now we added a description of the rationale behind the choice of the multi-layer community detection analysis over the conventional functional connectivity methods, and the added value of our study.

      Revisions to the main manuscript:

      p. 3: In this study, we examined the reconfiguration of whole-brain functional networks underlying the natural fluctuation in sustained pain to provide a mechanistic understanding of the brain responses to sustained pain.

      p. 7: In this study, we used this approach to examine the temporal changes of brain network structures during sustained pain, which cannot be done with conventional functional connectivity-based analyses (Lee et al., 2021).

      p. 27: However, the previous model provides a limited level of mechanistic understanding because of the high dimensionality of the model and its features. In addition, functional connectivity itself provides only limited insight into how functional brain networks are structured and reconfigured over time.

      Reviewer #2 (public Review):

      “The Authors J-J Lee et al., investigated cortical and subcortical brain networks and their organization in communities over time during evoked tonic pain. The paper is well-written, and the findings are interesting and relevant for the field. Interestingly, other than confirming well known phenomena (e.g., segregation within the primary somatomotor cortex) the Authors identified an emerging "pain supersystem" during the initial increase of pain, in which subcortical and frontoparietal regions, usually more segregated, showed more interactions with the primary somatomotor cortex. Decrease of pain was instead associated to a reconfiguration of the networks that sees subcortical and frontoparietal regions connected with areas of the cerebellum. The main novelty of the proposed analysis, lies in the resulting high performances of the classifier, that shows how this interesting link between frontoparietal network and subcortical regions with the cerebellum, is predictive of pain decrease. In summary, the main strengths of the present manuscript are: • Inclusion of subcortical regions: most of the recent papers using the Shaefer parcellation in ~200 brain areas1, do not consider subcortical areas, ignoring possible relevant responses and behaviors of those regions. Not only the Authors smartly addressed this issue, but most of their results showed how subcortical regions played a key role in the networks reconfiguration over time during evoked sustained pain.

      • Robust classification results: high accuracy obtained on training dataset (internal validation), using a leave-one-out approach, and on the available independent test dataset (external validation) of relatively large sample size (N=74).

      • Clarity in the description of aim and sub-aims and exhaustive presentation of the obtained results helped by appropriate illustrations and figures (I suggest less wording in some of them).

      • Availability of continuous behavioral outcome (track ball).”

      We appreciate the reviewer’s summary and positive evaluations.

      “Even though the results are mostly cohesive with previous literature, some of the results need to be discussed in relationship to recently published papers on the same topic as well as justifying some of the non-standard methodological procedures adding appropriate citations (or more detailed comments). The Authors do not touch upon the concept of temporal summation of pain, historically associated with tonic pain, especially when the study is finalized to better understanding brain mechanisms in chronic pain populations (chronic pain patients often exhibit increased temporal summation of pain2). I would suggest starting from the paper recently published by Cheng et al. that also shares most of the methodological pipeline3 to highlight similarities and novelties and deepen the comparison with the associated literature.”

      We thank the reviewer and editor for the comment on this important topic. Temporal summation of pain indicates progressively increased sensation of pain during prolonged noxious stimulation (Price, Hu, Dubner, & Gracely, 1977), and has been suggested as a hallmark of chronic pain disorders including fibromyalgia (Cheng et al., 2022; Price et al., 2002). In a recent study by Cheng et al. (2022), the authors induced tonic pain using constantly high cuff pressure and examined whether the participants experienced increased pain in the late period compared to the early period of pain. On the contrary, in our experimental paradigm, the capsaicin liquid initially delivered into the oral cavity is being cleaned out by saliva, and thus overall pain intensity was decreasing over time, not increasing (Figure 1B). Therefore, the temporal summation of pain may occur in a limited period (e.g., the early period of the run), but it is difficult to examine its effect systematically in our study.

      However, it is notable that Cheng et al.’s results overlap with our findings. For example, Cheng et al. reported the intra-network segregation within the somatomotor network and the inter-network integration between the somatomotor and other networks during the temporal summation of pressure pain in patients with fibromyalgia, which were similar to the findings we reported in Figure S9 and Figure 4. Although it is unclear whether these results reflect the temporal summation of pain, these network-level features shared across the two studies are likely to be an essential component of the sustained pain processes in the brain.

      Now we added a comment on the temporal summation of pain in the main manuscript.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      We thank the reviewer and editor for the information about this recent publication. Cheng et al. (2022) was not published at the time we wrote the manuscript, and we were surprised that Cheng et al. shares many aspects with our study, e.g., both used multilayer community detection and also reported similar findings, as described above.

      However, there were some differences between the two studies as well.

      First, the focus of our study was on the brain dynamics during the natural time-course of sustained pain from its initiation to remission in healthy participants, whereas the focus of Cheng et al. was on the temporal summation phenomenon of pain (TSP) and the enhanced TSP in patients with fibromyalgia patients. Because of this difference in the research focuses, our study and Cheng et al. are providing many nonoverlapping results and insights. For example, our study paid particular attention to the coping mechanisms of the brain (e.g., the network-level changes in the subcortical and frontoparietal network regions) and the brain systems that are correlated with the natural decrease of pain (e.g., the cerebellum in Figure 5). In contrast, Cheng et al. (2022) identified the brain connectivity and network features important for the increased TSP in fibromyalgia patients.

      Second, our great interest was in identifying and visualizing the fine-grained spatiotemporal patterns of functional brain network changes over the period of sustained pain. To utilize fine-grained brain activity information, we conducted our main analyses at a voxel-level resolution and on the native brain space, such as in Figures 2-3 and Figures S5, S7, and S8. With this fine-grained spatiotemporal mapping, we were able to identify small, but important voxel-level dynamics.

      We now cited Cheng et al. (2022) in multiple places and revised the manuscript accordingly.

      Revisions to the main manuscript (p. 26):

      Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Here the main significant weaknesses of the study:

      • The data analysis is entirely conducted on young healthy subjects. This is not a limitation per se, but the conclusion about offering new insights into understanding mechanisms at the basis of chronic pain is too far from the results. Centralization of pain is very different from summation and habituation, especially if all the subjects in the study consistently rated increased and decreased pain in the same way (it never happens in chronic pain patients). A similar pipeline has been actually applied to chronic pain patients (fibromyalgia and chronic back pain)3,4. Discussing the results of the present paper in relationship to those, could offer a more robust way to connect the Authors' results to networks behavior in pathological brains.”

      We are grateful for the opportunity to discuss the clinical implication of our study. First of all, we agree with the reviewer and editor that we cannot make a definitive claim about chronic pain with the current study, and thus, we revised the last sentence of the abstract to tone down our claim.

      Revisions to the main manuscript (p. 2, in the abstract):

      This study provides new insights into how multiple brain systems dynamically interact to construct and modulate pain experience, advancing our mechanistic understanding of sustained pain.

      However, as we noted above in E-4, some of our findings were consistent with the findings from a previous clinical study (Cheng et al., 2022), suggesting the potential to generalize our study to clinical pain conditions. In addition, we previously reported that a predictive model of sustained pain derived from healthy participants performed better at predicting the pain severity of chronic pain patients than the model derived directly from chronic pain patients (Lee et al., 2021), highlighting the advantage of the “component process approach.”

      The component process approach aims to develop brain-based biomarkers for basic component processes first, which can then serve as intermediate features for the modeling of multiple clinical conditions (Woo, Chang, Lindquist, & Wager, 2017). This has been one of the core ideas of the Research Domain Criteria (RDoC) (Insel et al., 2010) and the Hierarchical Taxonomy of Psychopathology (HiTOP) (Kotov et al., 2017). If the clinical pain of a patient group is modeled as a whole, it becomes unclear what is being modeled because of the multidimensional and heterogeneous nature of clinical pain (Melzack, 1999) as well as other co-occurring health conditions (e.g., mental health issues, medication use, etc.). The component process approach, in contrast, can specify which components are being modeled and are relatively free from heterogeneity and comorbidity issues by experimentally manipulating the specific component of interest in healthy participants.

      The current study was conducted on healthy young adults based on the component process approach. We used oral capsaicin to experimentally induce sustained pain, which unfolds over protracted time periods and has been suggested to reflect some of the essential features of clinical pain (Rainville, Feine, Bushnell, & Duncan, 1992; Stohler & Kowalski, 1999). Therefore, the detailed characterization of the brain processes of sustained pain will be able to serve as an intermediate feature of multiple clinical conditions in future studies.

      Now we added the discussion on the clinical generalizability issue in the discussion section.

      Revisions to the main manuscript:

      pp. 25-26: An interesting future direction would be to examine whether the current results can be generalized to clinical pain. Experimental tonic pain has been known to share similar characteristics with clinical pain (Rainville et al., 1992; Stohler & Kowalski, 1999). In addition, in a recent study, we showed that an fMRI connectivity-based signature for capsaicin-induced orofacial tonic pain can be generalized to chronic back pain (Lee et al., 2021). Therefore, a detailed characterization of the brain responses to sustained pain has the potential to provide useful information about clinical pain.

      p. 26: Interestingly, a recent fMRI study on the temporal summation of pain in fibromyalgia patients reported results similar to ours (Cheng et al., 2022), including the intra-network dissociation within the somatomotor network and the inter-network integration between the somatomotor and other networks during pain. Although we cannot directly examine whether the temporal summation of pain gave rise to these network-level changes due to the limitation of our experimental paradigm, these consistent findings between the two studies may suggest that our findings could be generalized to clinical conditions.

      “Vice versa, the behavioral measure used to assess evoked pain perception (avoidance ratings), has been developed for chronic pain patients and never validated on healthy controls5. It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects6,7.”

      We acknowledge that pain avoidance measures are not fully validated in the healthy population. Nevertheless, we used this measure in this study for the following two main reasons that outweigh the limitations.

      First, a pain avoidance rating provides an integrative measure that can reflect the multi-dimensional aspects of sustained pain. One of the essential functions of pain is to avoid harmful situations and promote survival, and the avoidance motivation induced by pain is composed of not only sensory-discriminative, but also cognitive components including learning, valuation, and contexts (Melzack, 1999). According to the fear-avoidance model (Vlaeyen & Linton, 2012), if the pain-induced avoidance motivation is not resolved for a long time and is maladaptively associated with innocuous environments, chronic pain is likely to develop, suggesting the importance and clinical relevance of pain avoidance measures. In addition, our experimental design is particularly suitable for the use of avoidance rating because the oral capsaicin stimulation is accompanied by the urge to avoid the painful sensation, but it cannot immediately be resolved similar to chronic pain. Moreover, capsaicin is sometimes experienced as intense but less aversive (or even appetitive) in some cases, e.g., spicy food craver (Stevenson & Yeomans, 1993). In this case, avoidance ratings can provide a more reasonable measure of pain compared to the intensity rating.

      Second, the avoidance measure provides a common scale on which we can compare different types of aversive experiences, allowing us to conduct specificity tests for a predictive model of pain. For example, a recent study successfully compared the brain representations of two types of pain and two types of aversive, but non-painful experiences (e.g., aversive auditory and visual experiences) using the same avoidance measure (Ceko, Kragel, Woo, Lopez-Sola, & Wager, 2022). These comparisons were possible because the avoidance measure provided one common scale for all the aversive experiences regardless of their types of stimuli.

      To provide a better justification for the use of the avoidance measure, we now included the specificity test results of our pain predictive models. More specifically, we tested our module allegiance-based SVM and PCR models of pain on the aversive taste and aversive odor conditions (Figure S13).

      Despite these advantages, the use of avoidance rating without thorough validation is a limitation of the current study, and thus future studies need to examine the psychometric properties of the avoidance rating, e.g., examining the relationship among pain intensity, unpleasantness, and avoidance measures. However, the current study showed that the predictive models derived with pain avoidance rating (Study 1) could be used to predict the pain intensity rating (Study 2). In addition, the overall time-course of pain avoidance ratings in Study 1 was similar to the time-course of pain intensity ratings in Study 2, providing some supporting evidence for the convergent validity of the pain avoidance measure.

      As to the following comment, “It might not be an appropriate measure considering the total absence of pain variability in the reported responses over forty-eight subjects,” there are pieces of evidence supporting that the low between-individual variability of ratings is due to the characteristics of our experimental design, not to the fact that we used the avoidance measure. As we discussed in more detail in our response to E-1, our experimental procedure based on capsaicin liquid commonly induces the initial burst of painful sensation and the subsequent gradual relief for most of the participants (Figure 1B, left). A similar time-course pattern of ratings was observed in Study 2 (Figure 1B, right), which used the pain “intensity” rating, not the pain avoidance rating. In addition, previous studies with a similar experimental design (i.e., intra-oral capsaicin application) (Berry & Simons, 2020; Lu, Baad-Hansen, List, Zhang, & Svensson, 2013; Ngom, Dubray, Woda, & Dallel, 2001) also showed a similar time-course of pain ratings with low between-individual variability regardless of the rating types (e.g., VAS or irritation intensity), confirming that this observation is not unique to the pain avoidance rating.

      Now we added descriptions on the small between-individual variability of pain ratings and the use of avoidance ratings.

      Revisions to the main manuscript:

      pp. 5-7: Note that the overall trend of pain ratings over time was similar across participants because of the characteristics of our experimental design, which has also been observed in the previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001). However, also note that each individual’s time-course of pain ratings were not entirely the same (Figures S2 and S3).

      p. 26: However, there are also differences between the characteristics of capsaicin-induced tonic pain versus clinical pain. For example, clinical pain continuously fluctuates over time in an idiosyncratic pattern (Apkarian, Krauss, Fredrickson, & Szeverenyi, 2001), whereas capsaicin-induced tonic pain showed a similar time-course pattern across the participants—i.e., increasing rapidly and then decreasing gradually (Figure 1B). This typical time-course of pain ratings has been reported in previous studies that used oral capsaicin (Berry & Simons, 2020; Lu et al., 2013; Ngom et al., 2001).

      pp. 26-27: Note that Study 1 used a pain avoidance measure that is not yet fully validated in healthy participants. However, we chose to use the pain avoidance measure, which can provide integrative information on the multi-dimensional aspects of pain (Melzack, 1999; Waddell, Newton, Henderson, Somerville, & Main, 1993). It also has a clinical implication considering that the maladaptive associations of pain avoidance to innocuous environments have been suggested as a putative mechanism of transition to chronic pain (Vlaeyen & Linton, 2012). Lastly, the avoidance measure can provide a common scale across different modalities of aversive experience, allowing us to compare their distinct brain representations (Ceko et al., 2022) or test the specificity of their predictive models (Lee et al., 2021) (Figure S13). Although the psychometric properties of the pain avoidance measure should be a topic of future investigation, we expect that the pain avoidance measure would have a high level of convergent validity with pain intensity given the observed similarity between pain avoidance (Study 1) and pain intensity (Study 2) in their temporal profiles. The generalizability of our PCR model across Studies 1 and 2 also supports this speculation. However, there would also be situations in which pain avoidance is dissociated from pain intensity. For example, capsaicin can be experienced to be intense but less aversive or even appetitive in some contexts, such as cravings for spicy food (Stevenson & Yeomans, 1993). In addition, the gradual rise of avoidance ratings during the late period of the control condition in Study 1 would not be observed if the intensity measure was used. Future studies need to examine the relationship between pain avoidance and the other pain assessments and the advantage of using the pain avoidance measure.

      “• The dynamic measure employed by the Authors is better described from the term "windowed functional connectivity". It is often considered a measure of dynamic functional connectivity and it gives information about fluctuations of the connectivity patterns over time. Nevertheless, the entire focus of the paper, including the title, is on dynamic networks, which inaccurately leads one to think of time-varying measures with higher temporal resolution (either updating for every acquired time point, as the Authors did in their previous publication on the same dataset4, or sliding windows involving weighting or tapering8,9). This allows one to follow network reorganization over time without averaging 2-min intervals in which several different brain mechanisms might play an important role3,10,11. In summary, the assumption of constant response throughout 2-min periods of tonic pain and the use of Pearson correlations do not mirror the idea of dynamic analysis expressed by the Authors in title and introduction. I would suggest removing "dynamic" from the title, reduce the emphasis on this concept, address possible confounds introduced by the choice of long windows and rephrase the aim of the study in terms of brain network reconfiguration over the main phases of tonic pain experience.”

      Now we removed the word ‘dynamic’ from many places in the manuscript, including the title. In addition, we added a brief discussion on the reason we chose to use the long and non-overlapping windows for connectivity calculation.

      Revisions to the main manuscript (p. 8):

      Although the long duration of the time window without overlaps may obscure the fine-grained temporal dynamics in functional connectivity patterns, we chose to use this long time window based on previous literature (Bassett et al., 2011; Robinson, Atlas, & Wager, 2015), which also used long time windows to obtain more reliable estimates of network structures and their transitions.

      “• Procedure chosen for evoking sustained pain. To the best of my knowledge, capsaicin sauce on the tongue is not a validated tonic pain procedure. In favor of this argument is the absence of inter-subject variability in the behavioral results showed in the paper, very unusual for response to painful stimulations. The procedure is well described by the Authors, and some precautions like letting the liquid drying before the start of the scan, have helped reducing confounds. Despite this, the measures in figure 1B suggest that the intensity of the painful stimulation is not constant as expected for sustained pain (probably the effect washes out with the saliva). In this case, the first six-minute interval requires particular attention because it encapsulates the real tonic pain phase, and the following ones require more appropriate labels. Ideally the Author should cite previous studies showing that tongue evoked pain elicits a very specific behavioral response (summation, habituation/decrease of pain, absence of pain perception). If those works are missing, this response need to be treated as a funding rather than an obvious point.”

      We addressed this comment. Moreover, we could find previous studies that experimentally induced tonic pain through the application of capsaicin on the tongue (Berry & Simons, 2020; Boudreau, Wang, Svensson, Sessle, & Arendt-Nielsen, 2009; Green, 1991; Ngom et al., 2001), suggesting that our experimental procedure is in line with previous literature.

      Reviewer #3 (Public Review ):

      “In their manuscript, Lee and colleagues explore the dynamics of the functional community structure of the brain (as measured with fMRI) during sustained experimental pain and provide several potentially highly valuable insights into, and evaluate the predictive capacity of, the underlying dynamic processes. The applied methodology is novel but, at the same time, straightforward and has solid foundations. The findings are very interesting and, potentially, of high scientific impact as they may significantly push the boundaries of our understanding of the dynamic neural processes during sustained pain, with a (somewhat limited) potential for clinical translation.

      However (Major Issue 1), after reading the current manuscript version, not all of my doubts have been dissolved regrading the specificity of the results to pain. Moreover (Major Issue 2), some of the results (specifically, those related to the group level analysis of community differences) do not seem to be underpinned with a proper statistical inference in the current version of the manuscript and, therefore, their presentation and discussion may not be proportional to the degree of evidence. Next to these Major Issues (detailed below), some other, minor clarifications might also be needed before publications. These are detailed below or in the private part of the review ("Recommendations for the authors").

      Despite these issues, this is, in general, a high quality work with a high level of novelty and - after addressing the issues - it has a very high potential for becoming an important contribution (and a very interesting read) to the pain-research community and beyond.”

      We appreciate the reviewer’s thoughtful comments. We have revised the manuscript to address the Reviewer’s major concerns, as described below.

      “Major Issue 1:

      The main issue with the manuscript is that it remains somewhat unclear, how specific the results are to pain.

      Differences between the control resting state and the capsaicin trials might be - at least partially - driven by other factors, like:

      • motion artifacts

      • saliency, attention, axiety, etc.

      Differences between stages over the time-course might, additionally, be driven by scanner drifts (to which the applied approach might be less sensitive, but the possibility is still there ) or other gradual processes, e.g. shifts in arousal, attention shifts, alertness, etc.

      All the above factors might emerge as confounding bias in both of the predictive models.

      This problem should be thoroughly discussed, and at least the following extra analyses are recommended, in order to attenuate concerns related to the overall specificity and neurobiological validity of the results:

      • reporting of, and testing for motion estimates (mean, max, median framewise displacement or anything similar)

      • examining whether these factors might, at least partially, drive the predictive models.

      • e.g. applying the PCR model on the resting state data and verifying of the predicted timecourse is flat (no inverse U-shape, that is characteristic to all capsaicin trials).

      Not using the additional sessions (bitter taste, aversive odor, phasic heat) feels like a missed opportunity, as they could also be very helpful in addressing this issue.”

      We thank the reviewer for this comment on the important issue regarding the specificity of our results and the potential influences of noise. The effects of head motion and physiological confounds are particularly relevant to pain studies because pain involves substantial physiological changes and often causes head motion. To address the related concerns of specificity, we conducted additional analyses assessing the independence of our predictive models (i.e., SVM and PCR models) from head movement and physiology variables and the specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1.

      First, we examined the overall changes of framewise displacement (FD) (Power, Barnes, Snyder, Schlaggar, & Petersen, 2012), heart rate (HR), and respiratory rate (RR) in the capsaicin condition (Figure S11). For the univariate comparison between the capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, the capsaicin condition caused significant changes in head motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). In addition, the increased motion and autonomic responses were more prominent in the early period of pain (Figure S11B). The 10-binned (2 mins per time-bin) FD and HR showed a decreasing trend while the RR showed an increasing trend over time in the capsaicin condition. The comparisons between the early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) periods of the capsaicin condition showed significant differences both for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). These results suggest that while participants were experiencing capsaicin tonic pain, particularly during the early period, head motion and heart rate were increased, while breathing was slowed down. Note that we needed to exclude 4 participants’ data in this analysis due to technical issues with the physiological data acquisition.

      Next, we examined whether the changes in head motion and physiological responses influenced our predictive model performance (Figure S12). We first regressed out the mean FD, HR, and RR (concatenated across conditions and participants as we trained the SVM model) from the predicted values of the SVM model with leave-one-subject-out cross-validation (2 conditions × 44 participants = 88) and then calculated the classification accuracy again (Figure S12A). The results showed that the SVM model showed a reduced, but still significant classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). We also did the same analysis for the PCR model (10 time-bins × 44 participants = 440) and the PCR model also showed a significant prediction performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) (Figure S12B). These results suggest that our SVM and PCR models capture unique variance in tonic pain above and beyond the head movement and physiological changes.

      Lastly, we examined the specificity of our predictive models to pain, by testing the models on the non-painful but aversive conditions including the bitter taste (induced by quinine) and aversive odor (induced by fermented skate) conditions (Figure S13). All the model responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall model responses of the SVM model for the bitter taste and aversive odor conditions were higher than those for the control condition but lower than the capsaicin condition (Figure S13A). Classification accuracies for comparing capsaicin vs. bitter taste and capsaicin vs. aversive odor were all significant (for capsaicin vs. bitter taste, accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; for capsaicin vs. aversive odor, accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), supporting the specificity of our SVM model of pain. Similarly, the model responses of the PCR model for the bitter taste and aversive odor conditions were lower than the capsaicin condition, and their temporal trajectories were less steep and fluctuating compared to the capsaicin condition (Figure S13B). The time-course of the model responses for the control condition was flatter than all other conditions and did not show the inverted U-shape. Furthermore, the model responses of the bitter taste and aversive odor conditions did not show the significant correlations with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain.

      Overall, we have provided evidence that our models can predict pain ratings above and beyond the head motion and physiological changes and that the models are more responsive to pain compared to non-painful aversive conditions.

      Now we added descriptions on the specificity tests to the main manuscript and also to the Supplementary Information.

      Revisions to the main manuscript (p. 20):

      Specificity of the module allegiance-based predictive models To examine whether the predictive models were specific to pain and the prediction performances were not influenced by confounding variables such as head motion and physiological changes, we conducted additional analyses as shown in Figures S11-13. The SVM and PCR models showed significant prediction performances even after controlling for head motion (i.e., framewise displacement) and physiological responses (i.e., heart rate and respiratory rate) (Figures S11 and S12) and did not respond to the non-painful but aversive conditions including the bitter taste and aversive odor conditions (Figure S13), supporting the specificity of our predictive to pain. For details, please see Supplementary Results.

      Revisions to the Supplementary Information (pp. 2-4):

      Specificity analysis (Figures S11-13) To examine whether the predictive models (i.e., SVM and PCR models) were specific to pain and not influenced by confounding noises, we conducted additional specificity analysis assessing the independence of the models from head movement and physiology variables and specificity of our models to pain versus non-painful aversive conditions (i.e., bitter taste and aversive odor) in Study 1. First, we examined the overall changes of framewise displacement (FD) (Power et al., 2012), heart rate (HR), and respiratory rate (RR) in sustained pain (Figure S11). For the univariate comparison between capsaicin vs. control conditions (Figure S11A), the results showed that, as expected, capsaicin condition caused significant changes in motion and autonomic responses. The mean FD and HR were significantly higher, and the RR was lower in the capsaicin condition compared to the control condition (FD: t47 = 5.30, P = 2.98 × 10-6; HR: t43 = 4.98, P = 1.10 × 10-5; RR: t43 = -1.91, P = 0.063, paired t-test). For the temporal changes of movement and physiology variables (Figure S11B), the results showed that the increased motion and autonomic responses are more prominent in the early period of pain. The 10-binned (2 mins per time-chunk) FD and HR showed decreasing trend while the RR showed increasing trend over time in capsaicin condition. Additional univariate comparisons between early (1-3 bins, 0-6 min) vs. late (8-10 bins, 14-20 min) period of capsaicin condition showed that differences were significant for FD and HR (FD: t47 = 6.45, P = 8.12 × 10-8; HR: t43 = 6.52, P = 6.41 × 10-8; RR: t43 = -1.61, P = 0.11, paired t-test). This suggests that while participants were experiencing tonic pain, particularly in the early period, motion and heart rate was increased but breathing was slowed. Note that we needed to exclude 4 participants’ data due to technical issues with physiological data acquisition. Next, we examined whether the head movement and physiological responses are the main driver of our predictive models (Figure S12). For all the original signature responses from SVM model (2 conditions × 44 participants = 88), we regressed out the mean FD, HR, and RR (concatenated across conditions and participants as the SVM model was trained) and calculated the classification accuracy (Figure S12A). Although the signature responses were controlled for movement and physiology variables, the SVM model still showed a high classification accuracy for the capsaicin versus control conditions in a forced-choice test (n = 44, accuracy = 89%, P = 1.41 × 10-7, binomial test, two-tailed). Similarly, for all the original signature responses from PCR model (10 time-bins × 44 participants = 440), we regressed out the 10-binned FD, HR, and RR (concatenated across time-bins and participants as the PCR model was trained) and calculated the within-individual prediction-outcome correlation (Figure S12B). Again, the PCR model showed a significantly high predictive performance (n = 44, mean prediction-outcome correlation r = 0.20, P = 0.003, bootstrap test, two-tailed, mean squared error = 0.159 ± 0.022 [mean ± s.e.m.]) while controlling for movement and physiology variables. These results suggest that our SVM and PCR models captures unique variance in tonic pain above and beyond the head movement and physiological changes. Lastly, we examined the specificity of our predictive models to pain, by testing the models onto the non-painful but tonic aversive conditions including bitter taste (induced by quinine) and aversive odor (induced by fermented skate) (Figure S13). All the signature responses were obtained using leave-one-participant-out cross-validation. The results showed that the overall signature responses of SVM model for bitter taste and aversive odor conditions were higher than those for control conditions, but lower than capsaicin condition (Figure S13A). Classification accuracy between capsaicin vs. bitter taste and vs. aversive odor were all significantly high (capsaicin vs. bitter taste: accuracy = 79%, P = 6.17 × 10-5, binomial test, two-tailed, Figure S13C; capsaicin vs. aversive odor: accuracy = 83%, P = 3.31 × 10-6, binomial test, two-tailed, Figure S13E), suggesting the specificity of SVM model to pain. Similarly, the temporal trajectories of the signature responses of PCR model for bitter taste and aversive odor conditions were not overlapping with that of the capsaicin condition (Figure S13B). Furthermore, the signature responses of bitter taste and aversive odor conditions do not have significant relationship with the actual avoidance ratings (bitter taste: mean prediction-outcome correlation r = 0.05, P = 0.41, bootstrap test, two-tailed, mean squared error = 0.036 ± 0.006 [mean ± s.e.m.], Figure S13D; aversive odor: mean prediction-outcome correlation r = 0.12, P = 0.06, bootstrap test, two-tailed, mean squared error = 0.044 ± 0.004 [mean ± s.e.m.], Figure S13F), suggesting the specificity of PCR model to pain. Overall, we have provided evidence that the module allegiance-based models can predict pain ratings above and beyond the movement and physiological changes, and are more responsive to pain compared to non-painful aversive conditions, which suggest the specificity of our results to pain.

      “Major Issue 2:

      Another important issue with the manuscript is the (apparent) lack of statistical inference when analyzing the differences in the group-level consensus community structures (both when comparing capsaicin to control and when analysing changes over the time-course of the capsaicin-challenge).

      Although I agree that the observed changes seem biologically plausible and fit very well to previous results, without proper statistical inference we can't determine, how likely such differences are to emerge just by chance.

      This makes all results on Figs. 2 and 3, and points 1, 4 and 5 in the discussion partially or fully speculative or weakly underpinned, comprising a large proportion of the current version of the manuscript.

      Let me note, that this issue only affects part of the results and the remaining - more solid - results may already provide a substantial scientific contribution (which might already be sufficient to be eligible for publication in eLife, in my opinion).

      Therefore I see two main ways of handling Major Issue 2:

      • enhancing (or clarifying potential misunderstandings regarding) the methodology (see my concrete, and hopefully feasible, suggestions in the "private part" of the review),

      • de-weighting the presentation and the discussion of the related results.

      I believe there are many ways to test the significance of these differences. I highlight two possible, permutation testing-based ideas.

      Idea 1: permuting the labels ctr-capsaicin, or early-mid-late, repeating the analysis, constructing the proper null distribution of e.g. the community size changes and obtain the p-values. Idea 2: "trace back" communities to the individual level and do (nonparametric) statistical inference there.”

      We appreciate this important comment. We did not conduct statistical inference when comparing the group-level consensus community affiliations of the different conditions (Figure 2) or different phases (Figure 3) because of the difficulty in matching the community affiliation values of the networks to be compared.

      For example, let us assume that the 800 out of 1,000 voxels of community #1 and 1,000 out of 4,000 voxels of community #2 in the control condition are commonly affiliated with the same community #3 in the capsaicin condition. To compare the community affiliation between two conditions, we should first match the community label of the capsaicin condition (i.e., #3) to that of the control condition (i.e., #1 or #2), and here a dilemma occurs; if we prioritize the proportion of the overlapping voxels for the matching, the common community should be labeled as #1, whereas if we prioritize the number of the overlapping voxels for the matching, the label of the common community should be #2. Although both choices look reasonable, none of them can be a perfect solution.

      As the example above, it is impossible to exactly match the community affiliation of the different networks. We must choose an imperfect criterion for the matching procedure, which essentially affects the comparison of network structure. This was the main reason that we limited our results of Figures 2-3 to a qualitative description based on visual inspection. Moreover, the group-level consensus community structures in Figures 2-3 are not a simple group statistic like sample mean; they were obtained from multiple steps of analyses including permutation-based thresholding and unsupervised clustering, which could further complicate the interpretation of statistical tests.

      Alternatively, there is a slightly different but more rigorous approach to the comparisons of the community structures, which is the Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). Instead of direct use of the community labels, this method converts the community label of each voxel into a list of module allegiance values between the seed voxel and all the voxels of the brain (i.e., 1 if the seed and target voxels have the same community label and 0 otherwise). This allows quantitative comparisons of voxel-level community profiles between different conditions without an arbitrarily matching of the community labels. We adopted this Phi-test for our analyses to examine whether the regional community affiliation pattern is significantly different between (i) the capsaicin vs. control conditions and (ii) the early vs. late periods of pain (Figure S6), which correspond to the main findings of the Figures 2 and 3 in our manuscript, respectively.

      More specifically, to compare the group-level consensus community structures between the capsaicin vs. control conditions and the early vs. late periods, we first obtained a seed-based module allegiance map for each voxel (i.e., using each voxel as a seed). Then, we calculated a correlation coefficient of the module allegiance values between two different conditions for each voxel. This correlation coefficient can serve as an estimate of the voxel-level similarity of the consensus community profile. Because module allegiance is a binary variable, these correlation values are Phi coefficients. A small Phi coefficient means that the spatial pattern of brain regions that have the same community affiliation with the given voxel are different between the two conditions. For example, if a voxel is connected to the somatomotor-dominant community during the capsaicin condition and the default-mode-dominant community during the control condition, the brain regions that have the same community label with the voxel will be very different, and thus the Phi coefficient will become small. Moreover, the Phi coefficient can be small even if a voxel is affiliated as the same (matched) community label for both conditions, when the spatial patterns of the same community is different between conditions.

      To calculate the statistical significance of the Phi coefficient, we conducted permutation tests, in which we randomly shuffled the condition labels in each participant and obtained the group-level consensus community structure for each shuffled condition. Then, we calculated the voxel-level correlations of the module allegiance values between the two shuffled conditions. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficients, and calculated the proportion of null samples that have a smaller Phi coefficient (i.e., a more dis-similar regional community structure) than the non-shuffled original data.

      Results showed that there are multiple voxels with statistical significance (permutation tests with 1,000 iterations, one-tailed) in the area where the community affiliations of the two contrasting conditions were different (Figure S6). For example, the frontoparietal and subcortical regions for the capsaicin vs. control (c.f., Figure 2), and the frontoparietal, subcortical, brainstem, and cerebellar regions for the early vs. late period of pain (c.f., Figure 3) contain voxels that survived after thresholding with FDR-corrected q < 0.05, suggesting the robustness of our main results.

      Particularly, the somatomotor and insular cortices showed statistical significance in the permutation test, and this may reflect the large changes in other areas that are connecting to the somatomotor and insular cortices across different conditions. The statistical significance was also observed in the visual cortex, which was unexpected. We interpret that the spatial distribution of the visual network community is too stable across conditions, and thus the null distribution from permutation formed a very narrow distribution of Phi coefficients. Therefore, a small change in the community structure could achieve statistical significance.

      Now we added descriptions on the permutation tests.

      Revisions to the main manuscript:

      p. 9: Permutation tests confirmed that the community assignment in the frontoparietal and subcortical regions showed significant changes between the capsaicin versus control conditions (Figure S6A).

      p. 13: Permutation tests further confirmed that the community assignment in the frontoparietal, subcortical, and brainstem regions showed significant changes between the early versus late period of pain (Figure S6B).

      pp. 36-37: Permutation tests for regional differences in community structures. To test the statistical significance of the voxel-level difference of consensus community structures (Figures 2 and 3), we performed the following Phi-test (Alexander-Bloch et al., 2012; Lerman-Sinkoff & Barch, 2016). First, for each given voxel, we compared the community label of the voxel to the community label of all the voxels, generating a list of voxel-seed module allegiance values that allow quantitative comparison of voxel-level community profile (e.g., [1, 0, 1, 1, 0, 0, ...], whose element is equal to 1 if the seed and target voxels were assigned to the same community and 0 otherwise). Next, a correlation coefficient was calculated between the module allegiance values of the two different brain community structures (i.e., capsaicin versus control, and early versus late). This correlation coefficient is an estimate of the regional similarity of community profiles (here, the correlation coefficient is Phi coefficient because module allegiance is a binary variable). To estimate the statistical significance of the Phi coefficient, we performed permutation tests, in which we randomly shuffled the labels and then obtained the group-level consensus community structures from the shuffled data. Then, the Phi coefficient between the module allegiance values of the two shuffled consensus community structures was calculated. We repeated this procedure 1,000 times to generate the null distribution of the Phi coefficient for each voxel. Lastly, we examined the probability to observe a smaller Phi coefficient (i.e., a more dissimilar community profile) than the one from the non-shuffled original data, which corresponds to the P-value of the permutation test. All the P-values were one-tailed as the hypothesis of this permutation test is unidirectional.

    1. Author Response

      Reviewer #1 (Public Review):

      In this manuscript by Chen et al., the authors use live-cell single-molecule imaging to dissect the role of DNA binding domains (DBD) and activation domains (AD) in transcription factor mobility in the nucleus. They focus on the family of HypoxiaInducible factors isoforms, which dimerize and bind chromatin to induce a transcriptional response. The main finding is that activation domains can be involved in DNA binding as indicated by careful observations of the diffusion/reaction kinetics of transcription factors in the nucleus. For example, different bound fractions of HIF-1beta and HIF2alpha are observed in the presence of different binding partners and chimeras. The paradigm of interchangeable parts of transcription factors has been eroded over the years (the recent work of Naama Barkai comes to mind, cited herein), so the present observations are not unexpected per se. Yet, the measurements are rigorous and wellperformed and have the important benefit of being in living cells. Enthusiasm is also dampened by the exclusive use of one technique and one analysis to reach conclusions.

      In the revised manuscript we complement the single molecule imaging experiments with genomic approaches, including Cut&Run and RNA-seq, that largely confirm our main conclusions derived from the SPT results. 

      Reviewer #2 (Public Review):

      The authors raise the very important question how different transcription factors with similar in vitro DNA sequence specificity are able to achieve distinct binding profiles associated with distinct functions. They use hypoxia inducible factors (HIF) as model system and combine live cell single-particle tracking with comprehensive genetic and chemical perturbations to study the mechanisms underlying isoform-specific gene regulation. Their main experimental readout is the distribution of diffusion coefficients of a molecular species, extracted from a population of single-particle trajectories. From this distribution, the authors extract the fractions of immobile and mobile molecules as well as the peak diffusion coefficient of the mobile fraction. They find that in addition to the structured DNA binding domain and the dimerization interface of HIF-1a and HIF-2a, the C-terminus of those factors, which includes intrinsically disordered regions and an activation domain, contributes to modulating the bound fraction of HIF-1b and the HIF-a isoforms. In particular, the C-terminus of HIF-2a mediates a higher bound fraction than the one of HIF-1a. This finding is important as it demonstrates that separating HIF into distinct domains that each have clearly defined functions is an oversimplification. Rather, a more holistic view seems suitable, in which all parts of HIF contribute to nuclear diffusion and binding.

      The conclusions drawn on the bound fractions and the nuclear dynamics of HIF isoforms are mostly backed up by data and proper controls. However, some controls are missing and some aspects of data analysis need to be clarified and extended. Moreover, the authors fail to answer their initial question, as the experimental readout does not contain information on the DNA sequences involved in the binding events.

      Experimental controls:

      For some imaging experiments, the authors use cell lines where endogenous HIF-1b or HIF-2a was fused to a N-terminal HaloTag by CRISPR/Cas editing. These cell lines are comprehensively controlled for proper functionality of the edited transcription factors, including expression levels, cellular localization and DNA binding. However, differential expression compared to unedited levels is not quantified and only Halo-HIF-2a is tested for functional gene transcription.

      To confirm that the tagged proteins still maintain normal function in driving target gene expression, we performed RNA-seq on WT cells, HaloTag-HIF-2α KIN and Halo-HIF-1β KIN cells, and show that gene expression on these edited cells do not differ significantly from unedited WT cells (Figure 1—figure supplement 3B, C).

      Other experiments include overexpression of exogenously expressed factors. For those, the authors give statements such as "expressed from a relatively strong ... promoter" and "weakly expressed", but do not provide any control of the amount of overexpression. Quantifying the expression levels will be important, as some of the author's experiments demonstrate a strong dependency of results on expression level. 

      We have now included Western Blot results showing L30-driven expression of all HIF variants in comparison with KIN levels (Fig 4—Figure Supplement 1). However, we note that cells stably expressing the HIF variants are polyclonal and Western Blotting is a bulk assay only able to assess the population average. As such, Western blot analysis may not reflect the actual expression level in the individual cells used in the imaging experiments. To properly control HIF expression at the individual cell level, we instead monitored the protein concentration in each cell and only chose to image cells with similar fluorescence level, as measured by localization density (Fig 4—Figure Supplement 1 and see detailed discussion in Appendix 2).

      Moreover, the authors do not provide any control for proper functionality of domainswap mutants.

      We now include RNA-seq results demonstrating that WT cells over-expressing HIF-α

      WT and domain swap variants (Halo-HIF-1α, Halo-HIF-1α/2α, Halo-HIF-2α, Halo-HIF2α/1α) can activate their specific target genes, confirming that all these variants are also transcriptionally active. (See Figure 6A, B, Figure 6—figure supplement 2 - increased binding of wild type or domain-swapped HIF to several gene loci or neighboring regions coincide with increased transcription levels of these genes, and Figure 7 - HIF expressing cells with same HIF-IDR co-cluster in their mRNA transcription profile).

      The authors further state that they use a high illumination power of 1100 mW. Such high laser power might be detrimental to cells and the authors should control whether this laser power induces any artifacts.

      We agree that a high illumination power (indispensable to achieve high signal-to-noise ratio and detect single molecules) may be detrimental to cells in the long run. However, we only took 1 movie with < 2000 frames for each cell. With a 5-ms frame rate, the total imaging duration per cell was under 10 seconds. Cells are unlikely to respond to any stimulus/damage in such a short time. Moreover, we used stroboscopic illumination instead of continuous illumination, with only 1-ms laser exposure for each 5-ms frame. The total integrated laser exposure is thus only 2 seconds. In addition, all imaging was done with a red laser (633 nm), which has a relatively low phototoxicity. Finally, the 1100 mW is the output from the laser box, but the actual laser power density used for imaging were measured to approximately 2.3 kW/cm2 at 633 nm (Graham et al., 2021). Such an imaging scheme is very unlikely to generate phototoxicity artifacts within the short time window of our measurements. Lastly, we are comparing results across all conditions with the exact same imaging set-up, so any artifact should be accounted and controlled for. We do consider fast SPT a terminal, end-point experiment, where each cell is only imaged once and never re-used.

      Data analysis:

      Distributions of diffusion coefficients greatly vary between individual cells (e.g. Fig. 2A and B, Fig. S3A and C, Fig. S4E). Unfortunately, the authors do not explain whether this variation is a real cell-to-cell variation, or rather reflects variation of their analysis method, potentially due to a low number of single particle tracks per cell. 

      We agree with the reviewer that the cell-to-cell variation we observed could be due to a low number of trajectories collected for each cell. In fact, sampling small numbers of trajectories allows us to identify protein species with unique diffusion coefficients, which might be lost if we just looked at a large population. Also, the fact that the diffusion coefficient distribution varies between cells does not mean that a particular cell only contains the more prevalent species that was detected. Here we are not trying to determine whether proteins in each cell indeed behave differently or whether the observed variation in the diffusion coefficient distribution is simply an effect of the limited trajectories collected in each cell. We instead analyzed data collected from many cells combined to get a better estimation of the population behavior. We have modified our text to make this important point clear to the readers. 

      Moreover, the bound fraction of HIF-1b differs between two independent measurements including three biological replicates each (Fig. 5 C and F). This raises the concern that not enough data enter each biological replicate, or not enough replicates are considered.

      Unfortunately, the number of cells that could be measured in our current setup is limited. It takes approximately 1 hour to collect 20 cells per sample, including staining, washing, looking for cells with desired expression level, and acquiring movies. For experiments with multiple conditions (>12), 20 cells per sample is the upper limit that can fit into a single day. 

      To address the question of what is the minimum number of cells/replicates needed we included in Figure 2—figure supplement 3 - the result of a bootstrapping analysis. We used data collected from a total of 243 cells of the same cell line, from over 11 replicates as the “population” and performed a bootstrapping analysis to identify the source of variation. We have also included appendix 1 with a detailed discussion. Our results showed that cell-to-cell variation contributes most to the total variation of the data, followed by day-to-day (replicate-to-replicate) variation. However, sampling over 800 trajectories, and from over 60 cells, imaged in 3 replicates well approximates the “population value” (bound fraction calculated from 243 cells from over 11 replicates). As a result, in each figure we always used over 60 cells from 3 replicates to generate the reported parameters. Although this approach still gives variable numbers from figure to figure, the variations seen for the same cell line are much smaller compared to the differences observed between different cell lines/conditions. 

      The authors compare the bound fractions among various mutants and experimental conditions. However, the peak diffusion is not, or only descriptively, evaluated. Thus, it is not clear whether the main effect of a mutation or chemical treatment is to change the bound fraction, or rather the diffusion coefficient of the mobile fraction. 

      Since there might be multiple mobile populations (defined as the fraction with a diffusion coefficient > 0.5 μm2/sec), the mean diffusion coefficient can change while the mode (peak) diffusion coefficient stays the same and vice versa. Because of such complexity in the mobile population, we prefer to use descriptive words to report the trend for the change instead of reporting exact values. However, as requested, we have added peak diffusion coefficient information to relevant figures as bar plots. We have also included in Table 1 a summary of mean and mode diffusion coefficient estimated for moving molecules in all relevant figures for reader’s reference. Note that the diffusion coefficient estimation is on a log scale, and the larger the diffusion coefficient, the lower the resolution (e.g, there is 1-grid of difference both between 2.63 and 2.75, and between 9.55 and 10).

      Conclusions:

      The authors provide data that highlight a potential role of the intrinsically disordered domain of HIF in modulating the bound fraction of these transcription factors. They further claim that the intrinsically disordered domains have a main contribution to this bound fraction. However, the autors do not quantify how this contribution relates to those of the DNA binding domain or the dimerisation interface. Changes in bound fraction estimated from the data in e.g. Fig. 3C, Fig. 4C, Fig. 5C and F rather hint to a dominant effect of dimerisation, followed by DNA binding and a smaller contribution of the intrinsically disordered domain. The authors should quantify the relative changes of the bound fraction for all mutants and experimental conditions, to clarify the importance of the contribution of the intrinsically disordered domain.

      It would be ideal if we could quantify what percent of the bound fraction is contributed by dimerization interface, DBD and IDR, respectively. However, it is very likely that these different domains do not act independently of each other in terms of binding to chromatin fibers. In practice, it is very difficult to dissect and quantify these effects independently. For example, we did try to express HIF-1α and 2α with their IDR completely deleted; however, because the protein-degradation signals are within the IDRs, these deletions caused massive stabilization of these proteins, making it impossible to find cells that express these forms at similar levels as the full-length counterpart. As a result, although these IDR-deleted HIF-α show greatly reduced binding, we did not include the results in the paper because the loss of binding could also be due to the overall higher protein expression levels, leading to large unbound fractions. Regarding the DBD mutants, they only have 1 mutation, so it is hard to tell whether the remaining binding in Figure 5B is due to some residual binding affinity of HIF-α (HIF-α only partially lost its binding affinity), or is due to binding through its partner HIF-1β (HIF-α completely lost binding affinity, but can still bind through dimerization with HIF-1β). All we can safely conclude from Figure 5B is that HIF-α DBD is required for optimal binding, but we cannot determine how much exactly it contributes to binding. We thus argue that, given the interdependence of the different protein domains, the reviewer’s request is not experimentally feasible.

      The authors state that the intrinsically disordered domains of HIF determine their differential binding specificity to chromatin. However, the experiments provided do not allow for such a conclusion. In particular, measuring changes in the bound fractions is not sufficient. Such a conclusion requires a method that is able to inform about the DNA sequences involved in HIF binding, for example chromatin immunoprecipitation.

      As requested, we have included new Cut&Run and RNA-seq results in the revised manuscript showing HIF-α-IDR-specific binding and gene activation.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors report a public browser in which users can easily investigate associations between PGSs for a wide range of traits, and a large set of metabolites measured by the Nightingale platform in UKBB. This browser can potentially be used for identifying novel biomarkers for disease traits or, alternatively, for identifying novel causal pathways for traits of interest.

      Overall I have no major technical concerns about the study, but I would encourage the authors to revisit whether they can find a more compelling example that can better showcase the work that they have done. I understand that this is partly a resource paper but I think the resource itself can have more impact if the paper provides a clearer use-case for how it can drive novel biological insight.

      Many thanks for your comments. We have undertaken a new application of bi-directional Mendelian randomization to demonstrate how users may use this approach to disentangle whether associations in our atlas likely reflect either causes or consequences of PGS traits/diseases. This example is described on page 9:

      ‘For example, we applied Mendelian randomization (MR) to further evaluate associations highlighted in our atlas with triglyceride-rich very low density lipoprotein (VLDL) particles. For instance, both VLDL particle average diameter size and concentration were associated with the PGS for body mass index (BMI) (Beta=0.04, 95% CI=0.033 to 0.046, P<1x10-300 & Beta=0.012, 95% CI=0.006 to 0.019, P=2.7x104 respectively) and coronary heart disease (CHD) (Beta=0.026, 95% CI=0.019 to 0.032, P<1x10-300 & Beta=0.035, 95% CI=0.028 to 0.042, P<1x10-300 respectively). Conducting bi-directional MR suggested that the associations with average diameter of VLDL particles are likely attributed to a consequence of BMI and CHD liability as opposed to the size of VLDL particles having a causal influence on these outcomes (Supplementary Table 6). In contrast, MR analyses suggested that the concentration of VLDL particles increases risk of CHD (Beta=1.28 per 1-SD change in VLDL particle concentration, 95% CI=1.25 to 1.65, P=2.8x10-7) which may explain associations between the CHD PGS and this metabolic trait within our atlas.’

      and discussed in the discussion on page 21:

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

      PGS construction: It's unclear how well the PGS work. Should the reader prefer the stringent or lenient PGS? Perhaps there could be some validation with traits that have decent sample sizes in UKBB. Was there any filtering to remove traits with few GWS hits, low sample sizes, or low SNP heritability as these are unlikely to produce useful PGSs?

      An example of validation was previously included for the chronic kidney disease PGS and its association with circulating creatinine, although this has now been removed due to the feedback you provided in your comments below. However, we have now provided the weights for all of the PGS included in our web atlas should users want to use these scores for prediction purposes (page 7):

      ‘The specific weights for clumped variants used in all PGS can be found at https://tinyurl.com/PGSweights.’

      On page 8 we have mentioned that in this work we have used a more lenient threshold to facilitate endeavours in a ‘reverse gear Mendelian randomization’ framework. However, the option to use the more stringent threshold remains an option for users interested in this as an alternative:

      ‘In this paper, we have discussed findings using PGS that were derived using the more lenient criteria (i.e., P<0.05 & r2<0.1), although all findings based on both thresholds can be found in the web atlas.’

      ‘Specifically, we believe our findings can facilitate a ‘reverse gear Mendelian randomization’ approach to disentangle whether associations likely reflect metabolic traits acting as a cause or consequence of disease risk (Holmes and Davey Smith, 2019) as illustrated using triglyceride-rich very low density lipoprotein (VLDL) particles in the next section.’

      We have not filtering based on other criteria such as the number as SNPs given that certain scores, despite only been constructed using few SNPs, may still provide useful to users. For example, our score for ‘Drinks per day’ based on the more stringent threshold (i.e. P<5x10-8) consists of only 6 SNPs. However, one of these is rs1229984, a missense variant located at the alcohol dehydrogenase ADH1B gene region and known to be a strong predictor of alcohol use (e.g. https://pubmed.ncbi.nlm.nih.gov/31745073/).

      Reviewer #2 (Public Review):

      The authors set out to create an atlas of associations between phenome-wide polygenic scores and circulating lipids, fatty acids, and metabolites. To do so, they utilize GWAS from 129 traits available in the OpenGWAS database to derive polygenic (risk) scores (PGS) along with the recently released NMR metabolomics data containing 249 biomarkers (and ratios) in ~120,000 UK Biobank participants. The authors create a publicly available web portal containing PGS to NMR biomarker associations:

      http://mrcieu.mrsoftware.org/metabolites_PGS_atlas/.

      The strength of this study is in the comprehensive nature of the atlas, containing associations for 129 traits phenome-wide, the large sample size of the UK Biobank NMR data, and the use of PGS for prioritising molecular traits for follow-up experiments, which is an emerging area of interest (International Common Disease Alliance, 2020; Ritchie et al., 2021a). To our knowledge this study is the first to explore this for circulating metabolites.

      In its current form the atlas has several limitations, which should be straightforward to address. Notably, results in the current atlas may be confounded by (1) technical variation in the NMR data (Ritchie et al., 2021b), and (2) major biological determinants of biomarker concentrations, including body mass index, fasting time, and statin usage.

      Firstly, thank you for the suggestion to use your ‘ukbnmr’ R package to help remove technical variations from the UK Biobank NMR metabolites data. We have applied it to remove outliers and variation in the individual data due to (1) the duration between sample preparation and sample measurement, (2) position of samples on shipment plates, (3) different equipment (spectrometers) used. This meant that we needed to re-run our entire analysis pipeline for this project from scratch to the updated dataset. Results do not appear to have drastically changed, although nonetheless we have updated results from all downstream analyses in our online web atlas using this updated dataset provided by ‘ukbnmr’.

      Secondly, the reviewer is correct that biological factors, such as body mass index (BMI) and statin usage, are indeed strongly correlated with metabolites levels. However, we are not able to adjust for such biological factors directly in our analyses, given that they are potential colliders in the causal relationship between diseases/traits and metabolites. Statin usage may be caused by both the high genetic liability to coronary artery disease as well as abnormal lipoprotein lipid levels. Likewise, obesity (and changes in BMI) may result from a high genetic predisposition to cardiometabolic disorders and disrupted metabolism. Thus, adjusting for statin usage and BMI will induce collider bias (https://jamanetwork.com/journals/jama/fullarticle/2790247), which creates spurious associations between the disease/trait PGS and metabolites.

      To better illustrate this issue, we have added additional text on page 14 to justify this study design decision as well as added a new figure (Figure 3) to help demonstrate this clearly to the readers. Fasting time on the other hand we believe is unlikely to act as a collider and was adjusted as a covariate in all linear regression models in this work. This is mentioned on page 25.

      …Further, association results for two (of the 129) PGSs, systolic blood pressure (SBP) and diastolic blood pressure (DBP), are invalid (vastly inflated) as the GWASs used to construct these PGSs included UK Biobank samples.

      Many thanks for your suggestion. We have now removed the SBP and DBP PGS from our atlas due to overlapping samples in UKB. Furthermore, our colleagues at the University of Bristol have notified us that the Glioma GWAS data obtained from the OpenGWAS platform was uploaded with incorrect effect alleles. This PGS has also been subsequently removed from the atlas. Additionally, we removed the Alzheimer’s disease (without APOE) PGS because the pleiotropic effect of lipid associated genes is now systematically examined using lipid gene excluded PGS.

      To demonstrate how one might use these PGS to NMR biomarker associations to prioritise (or deprioritise) findings for follow-up, the authors select a biomarker of interest, glycoprotein acetyls (GlycA), to perform bi-directional Mendelian randomization to orient the direction of causal effects between GlycA and traits of associated PGS. However, the conclusions of this analysis are hampered by the heterogeneous nature of the GlycA biomarker, which captures the levels of five proteins in circulation (Otvos et al., 2015; Ritchie et al., 2019), making it a difficult target to appropriately instrument for Mendelian randomization analysis. This, however, does not detract from the broader point the authors make: that PGS can help prioritize molecular traits for experimental follow-up.

      We have now conducted further sensitivity analyses to evaluate the genetically predicted effects of each of the five proteins in the reference you have provided. This is discussed on page 11:

      ‘We also conducted further sensitivity analyses given that the NMR signal of GlycA is a composite signal contributed by the glycan N-acetylglucosamine residues on five acute-phase proteins, including alpha1-acid glycoprotein, haptoglobin, alpha1-antitrypsin, alpha1-antichymotrypsin, and transferrin (Otvos et al., 2015). Using cis-acting plasma protein (where possible) and expression quantitative trait loci (pQTLs and eQTLs) as instrumental variables for these proteins (Supplementary Table 12) did not provide convincing evidence that they play a role in disease risk for associations between PGS and GlycA (Supplementary Table 13). The only effect estimate robust to multiple testing was found for higher genetically predicted alpha1-antitrypsin levels on gamma glutamyl transferase (GGT) levels (Beta=0.05 SD change in GGT per 1 SD increase in protein levels, 95% CI=0.03 to 0.07, FDR=3.6x10-3), although this was not replicated when using estimates of genetic associations with GGT levels from a larger GWAS conducted in the UK Biobank data (Beta=1.6x10-3, 95% CI=-6.9 x10-3 to 0.01, P=0.71). For details of pleiotropy robust analysis and replication results see Supplementary Table 14.’

      There are also several important limitations to the study which cannot be addressed, which the authors discuss appropriately in the paper. First, the NMR data does not provide a comprehensive view of the metabolome - it is heavily focused on lipids and fatty acids. Many small metabolites in circulation cannot be measured by NMR spectroscopy, and further insights must wait for data from molecular profiling efforts planned or underway in UK Biobank (e.g. mass spectrometry). Second, the authors restricted analysis to participants of European ancestries. This a pragmatic analysis choice given (1) the PGSs were derived from GWAS performed in European ancestries, (2) PGS associations are particularly susceptible to confounding from genetic stratification and differences in environment, and (3) the very small sample sizes for which NMR data is currently available in UK Biobank participants. Finally, although a large sample size, UK Biobank is not a random sample of the population: healthy adults are over-represented, meaning PGS to metabolite associations may be different in disease cases or less healthy individuals.

      Overall this study has strong potential, with straightforward to address limitations, and the resulting atlas will provide a useful characterisation of the relationships between NMR biomarkers and polygenic predisposition to various traits and diseases, which can be used by domain experts to prioritise biomarkers or traits for experimental follow-up.

      Reviewer #3 (Public Review):

      Fang et al. created an atlas for associations between the genetic liability of common risk factors or complex disorders and the abundance of small molecules as well as the characteristics of major apolipoproteins in blood. The whole study is well executed, and the statistical framework is sound. A clear strength of the study is the large array of common risk factors and disease analyzed by means of polygenic risk scores (PGS). Further, the development of an open access platform with appealing graphical display of study results is another strength of the work. Such a reference catalog can help to identify novel biomarkers for diseases and possible causative mechanisms. The authors further show, how such a systematic investigation can also help to distinguish cause from causation. For example, an inflammatory molecule readily measured by the NMR platform and strongly associated in observational studies, is likely to be a consequence rather than a cause for common complex diseases.

      However, in its current form, the study suffers from some weakness that would need to be addressed to improve the applicability of the 'atlas'. This includes a distinction of locus-specific versus real polygenic effects, that is, to what extent are findings for a PGS driven by strong single genetic variants that have been shown to have dramatic impact on small molecule concentrations in blood.

      Thank you for your suggestions to help refine our work. In line with this comment, we have repeated all analyses 1) after applying the ‘ukbnmr’ R package as recommending by reviewer #2 to remove technical variations and outliers and 2) conducted sensitivity analyses to remove an established list of lipid gene loci from PGS construction. Full results can be interrogated in the web atlas to evaluate whether PGS association may be driven by locus-specific effects at these regions, which may be particularly informative given the representation of lipoprotein lipid metabolites on the NMR panel. Findings are reported on page 19:

      ‘The polygenic nature of complex traits means that the inclusion of highly weighted pleiotropic genetic variants in PGS may introduce bias into genetic associations within our atlas. To provide insight into this issue, we constructed PGS excluding variants within the regions of the genome which encode the genes for 14 major regulators of NMR lipoprotein lipids signals which captured 75% of the gene-metabolite associations in the Finnish Metabolic Syndrome In Men (METSIM) cohort (Gallois et al., 2019). For details of these genes see Supplementary Table 5).

      For PGS with these lipid loci excluded, anthropometric traits such as waist-to-hip ratio (N=209), waist circumference (N=206) and body mass index (N=205) still provided strong evidence of association with the majority of metabolic measurements on the NMR panel based on multiple testing corrections. Elsewhere however, the Alzheimer’s disease PGS, which was associated with 60 metabolic traits robust to P<0.05/19 in the initial analysis including these lipid loci (Supplementary Table 17), provided no convincing evidence of association with the 249 circulating metabolites after excluding the lipid loci based on the same multiple testing threshold (Supplementary Table 18). Further inspection suggested that the likely explanation for this attenuation of evidence were due to variants located within the APOE locus which are recognised to exert their influence on phenotypic traits via horizontally pleiotropic pathways (Ferguson et al., 2020).’

      …Further, it is unclear how much NMR spectroscopy adds over and above established clinical biomarkers, such as LDL-cholesterol or total triglycerides. This is in particular important, since the authors do not adequately distinguish between small molecules, such as amino acids, and characteristics of lipoprotein particles, e.g., the cholesterol content of VLDL, LDL or HDL particles, the latter presenting the vast majority of measures provided by the NMR platform. Finally, the study would benefit from more intriguing or novel examples, how such an atlas could help to identify novel biomarkers or potential causal metabolites, or lipoprotein measures other than the long-established markers named in the manuscript, such as creatinine or lipoproteins.

      To address these comments, we have added a new example focusing on the granular measures of VLDL particles provided by the NMR data (on top of the examples listed at the start of the response to reviewer document), which as the review points out is one of its strengths of the measures generated by this platform over long-established biomarkers (page 21):

      ‘We likewise conducted bi-directional MR to demonstrate that associations between the CHD PGS and VLDL particle size likely reflect an effect of CHD liability on this metabolic trait. In contrast, the association between the CHD PGS and VLDL concentrations are likely attributed to the causal influence of this metabolic trait on CHD risk, suggesting that it is the concentration of these triglyceride-rich particles that are important in terms of the aetiology of CHD risk as opposed to their actual size. We envisage that findings from our atlas, as well as other ongoing efforts which leverage the large-scale NMR data within UKB, should facilitate further granular insight into lipoprotein lipid biology.’

    1. Author Response

      We appreciate the thoughtful and thorough critique provided by the two reviewers, and generally agree with their assessment. The revised submission will address the issues they raise. In particular, we agree that the framework of the paper should be broadened to include bacteria and the deep literature associated with coincidental selection.

    1. Author Response

      Evaluation Summary:

      The work by Volante et al. studied a new plasmid partition system, in which the authors discovered that four or more contiguous ParS sequence repeats are required to assemble a stable partitioning ParAB complex and to activate the ParA ATPase. The work reveals a new plasmid partitioning mechanism in which the mechanic property of DNA and its interaction with the partition complex may drive the directional movement of the plasmid.

      Thank you for the kind evaluation. But we wonder about the description of the pSM19035 partition system we studied here as “a new plasmid partition system”. This system itself is quite old. The editor might have meant “new” as a subject of a research, but plasmid partition systems involving RHH-ParB proteins have been studied by number of groups for some time, including the Alonso Lab, which has worked on the pSM19035 partition system number of years prior to our current collaboration for this paper. Therefore, we wonder if the term “new” is the most appropriate.

      Reviewer #1 (Public Review):

      This is a very thorough biochemical work that investigated the ParABS system in pSM19035 by Volante et al. Volante et al showed convincingly that a specific architecture of the centromere (parS) of pSM19035 is required to assemble a stable/functional partition complex. Minimally, four consecutive parS are required for the formation of partition complex, and to efficiently activate the ATPase activity of ParA. The work is very interesting, and the discovery will allow the community to compare and contrast to the more widespread/more investigated canonical chromosomal ParABS system (where ParB is a sliding CTPase protein clamp, and a single parS site is often sufficient to assemble a working partition complex). All the main conclusions in the abstract are justified and supported by biochemical data with appropriate controls. A proposed multistep mechanism of partition complex assembly and disassembly (summarized in Fig 6) is reasonable. Perhaps the only shortcoming of this work is that the team does not yet get to the bottom of why four consecutive parS are needed.

      Thank you for the kind evaluation. The last point is an important one. We would like to continue to test our current model to either obtain stronger supporting evidence or come up with better alternative model.

      *Reviewer #2 (Public Review):

      ParBs come in two variations, RHH and HTH. In this study, the authors examine the in vitro behavior of the RHH system, which is less studied. Two activities were carefully monitored; ATPase activation and ParA removal from DNA. The system is quite complex, but the authors have done a good job of examining parameter space. One question concerns the physiological relevance. Can this be assessed by uncoupling ParA/ParB expression (making it inducible with IPTG from the chromosome, for example) and testing plasmids with the various constructs?

      This is an excellent point; we agree this a shortcoming of the current study. As described in response to “Essential Revisions”, we very much wanted to include an experiment testing in vivo plasmid stability for different size parSpSM sites in this paper, and we put a significant effort. However, we encountered certain technical issues with the approach we tried, and we failed to obtain conclusive data in timely fashion before we run out of time. Although, we had preliminary data, which appeared to be consistent with the notion that shorter parS sites are non-functional and full-size parS sites are functional, the experiment had certain flaw, which we could not rectify immediately to our satisfaction. Therefore, we decided to postpone this part of the project and plan for broader physiological evaluation of the parSpSM sequence arrangements in near future. In the revision, we mentioned at the beginning of discussion that in vivo functional test of parSpSM site requirements still remains to be examined.

      The authors appear to suggest that the requirement for at least 4 ParB binding sites is due to the inability of ParBs of this type to spread inferring that for the ParB-HTH multiple ParBs bound to ParS are required. Has this been tested in this system?

      ParB spreading has been shown to be essential for the HTH-ParB to perform its role in partition function. We clarified the importance of HTH-ParB spreading for partition function on lines 44-45.

      In any event, another major difference between the two systems is that a peptide corresponding to the N-ter of ParB is sufficient to bind DNA indicating this type of ParB does not have to be bound to DNA to stimulate ParA. It would have been useful if the authors had commented on this.

      There seems to be a mistype here. “N-ter of ParB is sufficient to bind DNA indicating ……” is incorrect. Perhaps this was meant to be “N-ter of ParB is unable to bind DNA, indicating ……” This is not a qualitative difference between the HTH- and RHH-ParBs: the N-terminal ParA interacting peptides of HTH-ParBs also can activate their cognate ParA ATPase without parS DNA binding, and parS-dependency of ATPase activation for HTH-ParBs appears to be significantly less stringent compared to the case for RHH-ParB we report here. ParBpSM1-27 , which cannot bind parSpSM, could only stimulate ParApSM ATPase to at most 1/10 of the full size ParBpSM in the presence of active parSpSM. We clarified this on lines 156-157, and also added discussion about this contrast between the HTH- and RHH-ParBs and possible implications on lines 458-467.

      Reviewer #3 (Public Review):

      Drs. Volante, Alonso, and Mizuuchi presented a milestone experimental finding on how the distinct architecture of centromere (ParS) on bacterial plasmid drives the ParABS-mediated genome partition process. Rather than driven by cytoskeletal filament pushing or pulling as its eukaryotic counterpart, the genome partition in prokaryotes is demonstrated to operate as a burnt-bridge Brownian Ratchet, first put forward by the Mizuuchi group. To drive directed and persistent movement without linear motor proteins, this Brownian Ratchet requires two factors: 1) enough bonds (10s' to 100s') bridging the PC-bound ParB to the nucleoid-bound ParA to largely quench the diffusive motion of the PC, and 2) the PC-bound ParB 'kicks" off the nucleoid-bound ParA that can replenish the nucleoid only after a sufficient time delay, which rectifies the initial symmetry-breaking into a directed and persistent movement. Although the time delay in ParA replenishment is established as a common feature across different bacteria, the binding properties of PC-bound ParB vary greatly, which begs the question of how Brownian Ratcheting adapts to different cellular milieu to fulfill the functional fidelity.

      The finding in this work presented a new but important twist in the Brownian Ratchet paradigm. The authors showed that in the pSM19035 plasmid partition system, only four contiguous ParB-binding repeats in ParS are required for the ParA-ParB interactions that drive the PC partition. In other words, only four chemical bonds are needed for the PC partition. Crucially, the authors further demonstrated that distinct orientation (configuration?) of the ParB-binding repeats is required for this fidelity by their state-of-art biochemistry and reconstitution experiments. The authors then elaborated on a possible mechanism of how the smaller number of PC-bound ParB can drive directed and persistent PC movement by interacting with nucleoid ParA. If I understand correctly, in their proposed scheme, due to their specific orientation (configuration?), when two of the ParS-bound ParB molecules bind to the two nucleoid-bound ParA molecules there arises a torsional/distortional stress. Consequently, the thermal fluctuations preload the forming bonds, triggering the dissociation of the two ParB molecules from the PC. And the remaining PC-bound ParBs may kick off the ParAs that have a time delay in DNA-rebinding, while ParB molecules will replenish the ParS to initiate the next round. In this proposal, the key conceptual leap is that not only the substrate but the cargo remodels to underlie the Brownian Ratcheting.

      We thank the reviewer for kind evaluation of our work. The model proposed is highly speculative at this point. Despite it may appear rather detailed in order to account for the unexpected findings, we consider it only a working hypothesis to be revised or replaced by a better model in future. We thank for many useful suggestions, which we will follow in our revision.

    1. Author Response

      REVIEWER #1 (PUBLIC REVIEW):

      The study by Monterisi et al. reports that loss-of-function mutations in metabolic pathways do not necessarily have a negative impact on cancer growth. The authors suggest that small solutes transferred via gap junction channels formed between wild-type cells and cells express mutants defective in metabolic pathways rescue the metabolic-deficient cancer cells. Through the examination of multiple human cell lines with several advanced means to determine gap junction coupling, Cx26 was identified as a major connexin molecule involved in medicating gap junction coupling between colorectal cancer (CRC) cells. The gene mutations of three metabolic gene mutations were investigated for major metabolic function of the cell, pH regulation, glycolysis and mitochondrial function.

      Strengths: The paper tests a new hypothesis that the mutations that inactivate key metabolic pathways do not incur functional deficits in cancer cells expressing the mutants due to their gap junction coupling to wild type cells.

      From microarray data they identified multiple connexins expressed in various CRC cells. Several advanced analyses were used to assess gap junction coupling in CRC cells including fluorescence recovery after photobleaching (FRAP). The extent of permeability at steady state was evaluated using CellTracker dyes and coupling coefficients were determined. They also used flow-cytometry to study dye transfer, which will provide a quantitative, dynamic means for study cell coupling. The data showed that knocking down Cx26 could greatly reduce diffusive exchange in most of the CRC cells tested.

      The study focused on three metabolic genes, Na+/H+ exchanger NHE1, a regulator of intracellular pH, a glycolytic gene, ALDOA and mitochondrial respiration gene, NDUFS1. These genes were knocked out in the selected CRC cells highly expressing these genes. The co-culture studies were well executed with fluorescence-markers distinguishing the WT and knockout cells and well-defined readouts such as intracellular pH, media pH, glucose/lactate levels and mitochondrial O2 consumption and glycolytic acid.

      The experiments in general were well designed and conducted, and the data supported the conclusions. The paper is also logically written and figures were well presented providing clear graphic illustrations.

      Thank you for recognising the strengths and novelty of our findings.

      Weaknesses: Although the hypothesis is innovative, no clear justification is provided that illustrates the scenario representing the clinical situation. The remaining questions include: What kind of somatic mutations in cancer cells has little impact on their growth and progression?

      We have now added in vivo data (Fig 8) and revised the Introduction and Discussion to address this point. Briefly, the broader clinical relevance our findings relates to the notion of essential genes and their negative selection. We show that connexin-dependent coupling can rescue a genetic deficiency, provided the mutation-carrying cell can access wild-type neighbours for the missing function. This rescue effect is limited to processes that handle solutes that can pass via connexin channels, i.e. metabolic processes. As such, sporadic loss-of-function mutations in “essential genes” may not produce a functional deficit in human cancers. We demonstrate rescue extensively in vitro, and now in a xenograft model.

      We argue that our work can explain why certain metabolic genes are essential in vitro, but not in vivo. In monolayers of mutated cells, diffusion across gap junctions cannot rescue the mutant phenotype, because there is no wild-type cell available to supply the missing function. In contrast, mutations in vivo will arise sporadically and wild-type cells are typically available to couple onto the mutation-bearing cell, providing it with functional rescue. Thus, only in the former case would the lethality of essential genes emerge.

      Indeed, many notable studies have found genes of various metabolic pathways to be essential for growth in vitro. Such genes would be expected to undergo negative selection in vivo, but this is exceedingly rare according to multiple observations. By demonstrating metabolic rescue in co-cultures (i.e. a setting closer to the tumour) and (now) in xenografts, our work provides an explanation for this apparent paradox. Indeed, cells such as NDUFS1-negative SW1222 grow very, very slowly in culture compared to wild-type cells and require regular media changes to keep pH alkaline. However, coupling onto wild-type cells can rescue knock-out cells in vitro and in vivo. We argue that this finding explains why loss-of-function mutations in NDUFS1 (and similar genes) do not undergo negative selection in human tumours (despite in vitro predictions).

      The three proteins selected for this study were chosen to represent very distinct types of solute-handling processes. We illustrate our point in a (new) summary figure in Fig 8.

      What types of WT cells, within the tumour cells or with neighbouring normal cells? Whether the current experimental design closely recapitulates the scenario in vivo?

      Indeed, we find that stromal fibroblasts may also support cancer cells via gap junctions, as this is essentially the same concept (i.e. coupling onto a cell with wild-type genes). However, we feel that expanding our present submission to fibroblasts would make the volume of data exceeding large. Also, the methods we use for fibroblasts are different, and require a full manuscript on its own. For example, there is the issue of how to control for the radically different growth rates of fibroblasts and cancer cells. We chose co-cultures of WT and genetically altered CRC cells so that the co-cultures are of the same background, with just one element changing (i.e. the metabolite-handling gene). This makes our data easy to interpret, and thus strengthens our case. Our in vitro experiments were performed on monolayers, where cells can make contacts in 2D. In vivo, these contacts will spread in all dimensions, thus connectivity is likely to be even more significant. If anything, monolayers probably under-estimate the importance of connectivity, but this preparation is more accessible for studying cell-to-cell communication.

      We recognise the importance of adding in vivo data to firm our conclusions. To that end, we have analysed xenografts established from co-cultures of wild-type DLD1 and NDUFS1-KO SW1112 cells on one flank of a mouse, and Cx26-KO DLD1 and NDUFS1-KO SW1112 cells on the other flank. This experiment tested whether Cx26-dependent connections between mitochondrially-defective NDUFS1 KO SW1222 cells and respiring DLD1 cells (on left flank only) are able to stimulate growth of the former (GFP-tagged). Indeed, NDUFS1-deficient cells grew faster when rescued by Cx26-expressing DLD1 cells. In contrast, their growth decelerated when DLD1 cells were Cx26-negative. We include these experiments and their controls in Fig 8.

      The readouts for co-culturing for glycolytic ALDOA and NDUFS1 knockout are only cell mass, without determining the more relevant markers, glucose/lactate and mitochondrial O2 consumption and glycolytic acid production.

      Our readouts are two-fold: total biomass and the size of the genetically altered compartment of co-cultures (GFP). We can therefore follow the relative growth of KO cells, which is essential for describing their growth (dis)advantage. We appreciate other markers are informative. Indeed, we characterised KO and WT cells in terms of O2 consumption and acid production in Fig 7. However, it would not be possible to measure glucose consumption selectively in GFP-positive KO cells of a co-culture, as the assays available for this measure ensemble rates for the entire population of cells (e.g. in a single well). Nonetheless, we believe that biomass as a readout is highly relevant to cancer, and we hope the reviewer concurs with us.

      The study needs to include cells without functional gap junctions like the characterized negative control RKO cells.

      This is an excellent suggestion, and we have added data for RKO cells to several figures. As expected, these do not form a syncytium and cannot rescue genetic defects in co-cultured cells. New data are shown in Fig 3G-H, Fig 6-supp2 and Fig 7H, adding to existing RKO controls in Fig 2A/B. Briefly, RKO cells do not exchange CellTracker dyes in monolayers (Fig 3F/G), cannot rescue cells that are ALDOA-deficient (Fig 6-supp2), and cannot rescue NDUFS1-deficient cells (Fig 7H). We also added Cx26-KO DLD1 cells to the CellTracker experiments in Fig 3.

      REVIEWER #2 (PUBLIC REVIEW):

      This paper is a logical extension of the 50 year-old concept of the "bystander effect" in tumours, wherein the effects of anti-tumour chemotherapeutics extend beyond the cells that take them up due to spread through gap junctions to adjacent cells. In this case, however, the authors have creatively realized that the reverse might also occur, and that tumour cells with otherwise fatal mutations in essential metabolic pathways can be rescued by their neighbours through passage of the missing metabolites through gap junctions. This can explain why mutations in other critical pathways, such as protein synthesis and transporters, are selected against in rapidly growing tumours, but others in equally critical pathways of glycolysis, electron transport, etc. are not, despite these genes having been demonstrated to be essential in in vitro KO studies (where all cells in the plate have the critical gene knocked out). A series of elegant experiments are used to test this proposal in several colorectal cancer (CRC) cell lines using three examples - pH regulation (defective Na+/H+ exchanger NHE1), glycolysis (defective Aldolase A (ALDOA)) and oxidative phosphorylation (defective Complex 1 - NDUFS1).

      Thank you for these positive comments. We have added key references to the bystander effect in the Introduction, and explain how our findings build on these milestones.

      The authors first determine the levels of different Cx proteins expressed in each cell line, and determine that for most Cx26 and 31 are dominant, although come lines have a subset of cells with high Cx43 expression. They then use Cell Tracker Green to pre-label cells and use FRAP as a means to measure how well the cell population is coupled. This is a useful measurement but is significantly over-interpreted by the authors as a "permeability" in uM/min. This is not really a permeability, which requires knowledge of the concentration gradient of the permeant species, relative cell volumes, etc. Rather it is a rate of fluorescent recovery that is presumably correlated with, but not quantitatively related to, levels of coupling.

      Thank you for this comment. We would like to explain why we believe our FRAP experiments are able to estimate permeability in units of um/s. The rate of recovery of a solute in a cell following its “destruction” (here, photobleaching) is given as follows:

      dCcell/dt = p⋅P(Ccell-Csurround) … [1]

      Where subscripts ‘cell’ and ‘surround’ refer to the cell and its neighbours. P is the permeability of the barrier between these two compartments, and p is the ratio of the surface area of the barrier (i.e. membrane) to volume of the bleached cell. Within a “bleached” cell, we measure fluorescence.

      Since fluorescence (F) is proportional to concentration (C), we can substitute:

      C = α⋅F

      where α is a constant of proportionality. Thus, the rate of recovery (L.H.S. of equation [1]) becomes:

      dC/dt = d(α⋅F)/dt = α⋅dF/dt … [2]

      And the R.H.S. of equation [1] is re-written as: P⋅(Ccell-Csurround) = P⋅(α⋅Fcell-α⋅Fsurround) = α⋅P⋅ (Fcell-Fsurround) … [3]

      Putting [2] and [3] together,

      dFcell/dt = p⋅P⋅(Fcell-Fsurround)

      Prior to photobleaching, there are no (net) gradients, thus initial Fcell and Fsurround are equal.

      Thus, we can re-write the equation in terms of normalized fluorescence (f=F/F0):

      dfcell/dt = p⋅P⋅(fcell-fsurround)

      P can therefore be expressed as:

      P = dfcell/dt / (p⋅ (fcell-fsurround))

      Here, dfcell/dt is measured from the fluorescence recovery time course and fcell-fsurround is measured experimentally (in fact, bleaching in the cell is set to 50%, thus this takes the value of 0.5 by default). We can approximate the monolayer as a network of cuboidal cells. The cell’s volume is thus ‘area’ times ‘height’, and the cell’s surface (at which it contacts its neighbors) is the ‘cell’s perimeter’ times ‘height’. Thus, for the bleached cell,

      p = perimeter × height / area × height = perimeter / area.

      The perimeter and area can be measured from the acquired fluorescence images. Thus, we can describe permeability using data obtained from image stacks. We appreciate that this method makes certain geometrical approximations, but we believe these are not unreasonable. We explain the assumptions and calculations in Appendix 1. More information about the method is published by us in https://pubmed.ncbi.nlm.nih.gov/28368405/. Of course, we accept that these calculations are less accurate than, say, electrical conductance measurements, and to that end, we added a note of caution to the main text.

      This fluorescent recovery is shown to be sensitive to siRNA KO of Cx expression, but strangely its reduction is only correlated with KD of Cx26 in the 5 cell lines examined. KD of Cx43 (in LOVO cells) and Cx31 in all 5 cell lines had no effect or in some cases seemed to increase the rate of recovery (DLD1 and SNU1235). This is a notable finding, yet the authors choose to completely ignore it and continue with Cx26 KDs in studies of specific metabolite transfers. Some discussion should be included as to why KD pf these Cxs has no effect or causes an apparent increase in coupling of the cells.

      The effectiveness of GJB2 knockdown in ablating ensemble connectivity is most likely a reflection that Cx26 is likely the dominant conductance inherited from the parent epithelium. Other isoforms are expressed, but in most CRCs cells, these do not produce major coupling, as GJB2 knockdown was sufficient to uncouple many CRCs. These observations justify our choice of connexin for studying metabolic rescue functionally. These findings are also consistent with the good correlation between ensemble connectivity and GJB2 levels.

      Our data show a trend that GJB3 (Cx31) KD in DLD1 and SNU1235 cells and of GJA1 (Cx43) KD in LOVO cells produce an increase in coupling. However, when analysed by hierarchical (nested) analysis, these effects are not statistically significant, and for that reason we did not elaborate on these trends in the original submission. The apparent increase in conductivity in cells treated with GJA1 or GJB3 siRNA could reflect a compensatory response to the ablation of a specific message, closer contacts between cells allowing Cx26 to strengthen its connections, or a shift away from heterotypic channels involving Cx26 and Cx31/Cx43, towards homotypic Cx26. We did not see any consistent change in the intimacy of cell-cell contacts. We now performed western blots for connexins to probe for compensatory changes (see Fig 2-supp1). In comparison to wild-type cells, expression of Cx31 was not changed by GJB2 (Cx26) or GJA1 (Cx43) knockdown in DLD1 cells. GJB2 KO DLD1 cells did not induce expression of the other major isoform, Cx43. Also, in DLD1 cells, KD of GJB3 or GJA1 did not substantially change Cx26 levels. Similarly, KD of GJB3 did not affect Cx43 levels. In GJA1-high C10 cells, KD of GJB3 did not alter Cx43 levels, although a small decrease was observed with GJB2 KD on Cx43. Also in C10 cells, KD of GJB2 and GJA1 did not induce an increase in Cx31 levels.

      We agree that complex interactions between connexin genes are possible, but we feel that a molecular study of Cx gene regulation would fall outside the scope of the present manuscript. Our findings point to a prominent role of Cx26 in metabolic rescue, and to strengthen this point, we show that Cx26-negative cells that express other connexins (e.g. C10 cells or NCIH747 cells) cannot rescue ALDOA-deficient counterparts or NDUFS1-KO SW1222 cells (new data in Fig 6 and 7). We share the Reviewer’s enthusiasm about the interplay between connexins and will endeavour to study this further in the near future.

      Rather than just focus on acute transfer of dye between cells, the authors develop a system using 50/50 mixes of cells labelled with two junctionally permeant dyes and measured the degree of mixing at equilibrium (48 hours). This is presented as a "coupling coefficient", but how it is calculated, and its significance is not well described, and does not correlate with the historical use of this term in the literature. Nonetheless, the studies do seem to demonstrate a good degree of equilibration, although it would have been informative to determine of the cells that do not exchange dyes express Connexins. To document that this equilibration requires gap junctions, the authors employ low density cultures, which significantly decrease dye exchange. However, in at least one cell line (SW1222) dye exchange is only reduced by <50%, indicating a very high background to this assay. This is not addressed.

      Thank you for these comments. We agree that our description of the method was inadequate, and we have added the necessary information in Appendix 1. We have also added information about actual confluency and restructured the figure. We also added new data for RKO cells and DLD1-Cx26 KO cells, i.e. two negative controls (Fig 3H). We pondered about the best name for describing the numerical output of the method, and concluded that “coupling coefficient” is reasonable (provided we improve our description of it) because it is dimensionless, and like many coefficients has a finite range (here, zero to one). With further explanation, we hope this terminology is acceptable. The issue with SW1222 cells is that both low- and high-seeding densities produce clusters of cells. Even though overall cell numbers were different in high and low seeded cultures, actual connectivity within “islands” of cells remained high, hence their similar coupling coefficients (see Fig 3E). Indeed, this CRC line is unusual in this behaviour, so we only present data from the higher density.

      The most compelling part of the study is the use of reporters to directly demonstrate a role of Cx26 coupling of cells to rescue cells with mutations of the three genes mentioned above when mixed with normal neighbours. This case was most convincing in the cases of ALDOA and NDUFS1, with the data for the pH regulation requiring more explanation for full understanding of the data shown (e.g. Figs 7 G and H).

      Thank you for this comment. Studies of pHi regulation provide a unique opportunity to obtain single-cell resolution (unlike e.g. glycolytic assays). We took advantage of this, and therefore the figure on pHi presents a greater depth of analysis. Nonetheless, we agree the pH data need further explanation. We have expanded the text, and also added a bar plot of data on day 7, which now provides a clearer illustration of the rescue effect. This form of presentation was also adopted for ALDOA and NDUFS1 experiments in the subsequent figures.

      Overall, the study does a credible job of demonstrating that Cx26 coupling of CRC cells serves to rescue cells with mutations in critically necessary metabolic pathways, presumably due to transfer of metabolites from surrounding wt cells. However, some of the results indicate this is not a simple process where all connexins behave similarly, and some effort should be made to investigate if Cx31 and 43, which do not seem to play the same roles in maintaining cell coupling as Cx26, also play any role in such metabolic rescue.

      Thank you for this comment. We have addressed this by selecting three additional cell lines for study: RKO – a cell line with no major Cx expression; C10 – a cell line that expresses Cx43, but very low levels of Cx26; NCIH747 – a cell line that expresses Cx31, but low levels of Cx26. These additional experiments cover lines that are GJB2 (Cx26)-low/negative to test whether metabolic rescue is best achieved with Cx26. Our new data show that these cells are unable to rescue metabolic defects (new data provided in Fig 6H/I, Fig 7H, and Fig 6-supp2). These findings strengthen our case for a major role of Cx26, at least in CRC networks. Indeed, recent analyses by Robert Gatenby and colleagues have shown that mutations in GJB2 (Cx26) are exceedingly rare in cancer (a property not shown for other connexins genes). This is interpreted to mean that Cx26 plays a particularly prominent role, ostensibly for metabolic rescue.

      REVIEWER #3 (PUBLIC REVIEW):

      Strengths of the study include that it appears to be a careful and well thought out set of experiments. The analysis and treatment of multiplexed data is also sophisticated. For the most part, the work is clearly and logically described, as well as well illustrated. In general, the authors achieved their experimental goals, and the methods while not entirely new, do provide new twists and augmentations that should be useful to the field. A general weakness is that this is not entirely a new story. Instead, it is a variant of one of the oldest concepts in the field of gap junction biology i.e. "Metabolic cooperation". The term "Metabolic cooperation" (i.e., as mediated by gap junctions) was not mentioned by the authors, but it is a long-established and foundational concept in the field. Indeed, in a classic paper by Gilula and colleagues published in 1972, the experimental approach used was similar to that of the study in hand. These earlier authors showed how transformed cell lines with deficiencies in hypoxanthine metabolism can be "rescued" by "metabolic cooperation" in co-culture with metabolically competent cells via passing a gap junctional permeant molecule. This and other relevant papers were not cited. More importantly, the extant literature places the onus on the authors to explain and convince reviewers why this study is more than an incremental step.

      We apologise for not quoting these important and classical references. We have now added these works to our reference list (quoted in Introduction). At the time of these seminal discoveries, Loewenstein and colleagues made a case that connexins are absent in cancer, and this belief persisted for many decades. More recently, the role of gap junctions in cancers has garnered attention. With new gene manipulations (e.g. CRISPR/Cas9) and imaging techniques and improved xenografting, it is now possible to precisely study the impact of GJ on cancer metabolism. Moreover, we have a wide panel of cancer cell lines to study, and identify the prominent role of Cx26. We highlight that our study is the first to offer a mechanistic explanation for the absence of negative selection in cancer, a phenomenon which was not known in the 1970s. To strengthen our novelty, we now add in vivo data to Fig 8 that confirm in vitro findings.

    1. Author Response

      Reviewer #1 (Public Review):

      1. “The major weakness of the study is that with the interpretation of the results. The changes in tractography, behavior and TBM are what would be expected following lesions of the neostriatum”

      We appreciate this comment and would like to offer clarification. We respectfully disagree that the pattern of results presented in this manuscript are akin to what would be expected following striatal lesions. In NHPs, striatal lesions typically cause more extreme phenotypes than what we observed in our 85Q-treated animals. In macaques, bilateral putamen lesions can result in phenotypes that include seizures, inappetence, hyper-aggression, and other severe features.  This strongly impacts clinical scores and can make it unfeasible to care for the animals for multiple years. For these reasons, recent NHP HD lesion models have used only unilateral putamen lesions coupled with bilateral caudate lesions to model HD (as in the recent paper by Lavisse et al, 2019). Of additional relevance is that even the cognitive effects of these striatal lesions are more severe than what we observed in our 85Q-treated animals: for example, Lavisse reported reduced performance on similar “prefrontal” cognitive tasks by ~50%, whereas our AAV-HTT model exhibited only ~10% reductions in working memory. This mild, but significant, change in cognitive performance and motor function seen in our 85Q animals is much more akin to that which is observed in the early stages of HD.

      2. “The results have been interpreted as showing a progressive model, although evidence that there is progression is limited”...“begs the question as to whether or not the 85Q-lesioned monkeys would recover to a level similar to the 10Q animals if left for another 12 months”

      At the request of Reviewer 1, we added an additional 30-month timepoint and re-ran all of the analyses to include these new data.  All of the behavioral and neuroimaging data were re-analyzed with this final timepoint included (see Lines 125-141, 146-163, 173-194, 228-255, 270-294, 314-345). Additionally, due to the unidirectional nature of our hypothesis and on the advice of our bio-statistician, we applied one-tailed tests to the planned comparisons in this revision. To address the Reviewer’s point directly: 85Q-treated animals showed minimal evidence of functional recovery between the 20- and 30-months timepoints on the behavior tasks. In particular, working memory deficits measured with SDR and fine motor skills measured with Lifesaver Retrieval did not improve between 20- and 30-months (Figure 1C and 1F). Additionally, neurological rating scores in group 85Q remained consistently elevated (in the 5-7 range) between the 20- and 30-month timepoint. Taken together, we feel confident that these results do not show evidence of any significant functional recovery, out to 2.5 years (30-months). In terms of the longitudinal trajectories of the behavioral measures, we appreciate the Reviewer’s feedback regarding the use of the term ‘progressive’ and have tempered our language appropriately. We removed all instances of the word progressive/progressed except in the context of the motor rating scores, which show a significant Group x Timepoint interaction and demonstrate a clear progression.

      3. “The whole manuscript is written as though this is a genetically-relevant progressive model of HD. But the animals are normal, and so there is no genetic context relevant to HD”

      We thank Reviewer 1 for this comment. We recognize that viral-based animal models of HD, including the model characterized here, are not as genetically similar to the human condition compared to some of the other modeling approaches currently under investigation (ex. knock-in and gene editing). Limitations of the AAV-based HTT85Q model include: 1) vector packaging restrictions that prohibit expression of full-length HTT, 2) the use of a CAG promoter vs. an endogenous promoter that leads to overexpression of the transgene, 3) the use of cDNA versus genomic DNA excludes introns and therefore lacks the ability to produce alternatively spliced variants (ex, Exon 1), 4) the use of a mixed CAG-CAA repeat may preclude the possibility of somatic instability and 5) expression of HTT that is restricted to specific brain regions and cell types. All of these important limitations have been added to the discussion section in this re-submission (Lines 503-517).

      Despite these limitations, we feel that this AAV2:AAV2.retro-HTT85Q based model has some features that make it genetically-relevant to human HD including: 1) the expression of an N-terminal fragment of human HTT (N171), 2) the N-terminal fragment bears a pathological PolyQ expansion (85Q), 3) the expressed mHTT fragment forms neuronal aggregates that can be detected in the nucleus, 4) mHTT fragments are expressed in many of the same brain regions where aggregates are detected in human HD cases, with both regional and sub-regional specificity (ex. higher expression in anterior vs posterior cortical regions and expression primarily limited to deep cortical layers V/VI) and 5) expression of mHTT fragments in these regions leads to many of the same pathological and behavioral changes observed in HD patients.  Importantly, expression of the N-terminal portion of HTT allows for the evaluation of HTT lowering therapeutics that target first 3 exons (ASOs, miRNAs, zinc finger repressors, CRISPR-based therapies, etc), which cannot be evaluated in lesion-based models.

      4. “The authors state in the Abstract that the injection resulted in "robust expression of mutant huntingtin in the caudate and putamen". These data are not in the manuscript.”

      Evidence of mHTT expression in the caudate and putamen, as well as several other brain regions, via immunohistochemical and immunofluorescent staining is now included in the manuscript. Please see additions to the methods, results and discussion sections regarding these findings, as well as a new Figure 5, (see Lines 347-376, 756-788). Additionally, further details regarding an associated PET imaging study in this same cohort of animals using a mHTT aggregate-binding radioligand has been added to the discussion, (see Lines 437-443). Please also see response #13 (below).

      5. “The authors chose to use a fragment of the HD gene, with a very long repeat that is seen only in juvenile patients”

      Comments regarding the need to use a fragment of the HTT gene, versus the full-length gene, due to packaging constraints of the viral vector, were added to the discussion in the context of limitations (Lines 503-517), and also discussed above in response #3.  The choice to use a CAG repeat length of 85 (83 pure CAGs followed by a CAA/CAG cassette -see response #17 below for further details), was based off previous studies wherein similar CAG repeat lengths were used to create animal models of HD over the past several decades. Interestingly, while CAG repeat lengths in patients with adult-onset HD typically range from ~40-60, longer repeat lengths (>60) are typically required in animal models of HD to elicit pathological and behavioral manifestations of disease: transgenic, knock-in and viral vector-based rodent models (ranging from 72-150 CAGs), OVT73 transgenic sheep model (73 CAGs), transgenic and knock-in minipig models (ranging from 85-150 CAGs), transgenic and viral vector-based macaque models (ranging from 82-103 CAGs). See Ramaswamy et al, 2007 and Howland et al, 2021 for thorough reviews of these models.

      6. “For their cognitive testing, the authors used a task (delayed non-match to sample) that measures object recognition and familiarity. Before surgery, only 11/17 of the animals were successfully trained to complete this task. It is not clear how useful the data are when only 64% of the animals can be included.”

      We appreciate the Reviewer’s concerns and have decided to conservatively remove this data from the revised manuscript.

      7. “It is not clear how this monkey model will be useful for developing either disease biomarkers or therapeutic strategies for HD (as stated in the abstract)”. “The authors state that they hope the model will become a widely used resource. This seems an unlikely scenario, given the limitations of the current study and the challenges associated with using monkeys. They say that a major advantage of their technique is being able to generate large numbers of monkeys. But this is not a relevant argument if the usefulness of the model to investigate HD is not proven.”

      We thank the reviewer for requesting clarification on these important points. We believe that this model will be useful for developing therapeutic strategies because the HTT85Q-treated macaques express mutant HTT, along with HTT aggregates, in several key brain regions that are affected in human HD, along with undergoing regional gray matter atrophy and white matter microstructural alterations that correlate well with behavioral dysfunction. Studies currently under review elsewhere also show reduced dopamine neurotransmission and regional hypometabolism via PET imaging in this model. Together, or individually, these imaging and behavioral changes can serve as outcome measures when screening potential therapies. Possible therapeutic interventions that are amenable to screening in this model are included in the discussion.

      Regarding biomarker development, we have already engaged in PET imaging biomarker development in this model in collaboration with the CHDI foundation and the Molecular Imaging Center at the University of Antwerp, evaluating a candidate radioligand that binds to aggregated mHTT. See #13 below for a more detailed description of this PET study, including recent data showing its ability to bind to aggregated species of mHTT in several brain regions in this same cohort of HTT85Q macaques that correspond to 2B4 and em48 IHC staining (a manuscript describing these results has been prepared for submission and the PDF is included for the reviewers to peruse).

      The authors do envision this AAV-based macaque model becoming a resource for the HD research community. While this model does have certain limitations (now detailed in the Discussion), we respectfully assert that all of the HD animal models, both small and large, each have their own important limitations to consider when deciding on which to use to screen therapeutics. Selecting a specific animal model based on the individual scientific questions being asked will be required, and employing a combination of models may be an even more prudent strategy.

      While NHP research presents unique challenges (cost, housing requirements and recent challenges in availability, among them), we believe that viral vector-based NHP models could be more accessible to the HD research community compared to some of the other established large animal models; in that they may able to be readily created at contract research organizations (CROs), in addition to various academic research institutions. There are now many CROs that exist in the US, and elsewhere around the world, that have developed specific expertise in MRI-guided, intracranial delivery of AAVs into the NHP brain (including the caudate and putamen), in the context of assessing therapeutic interventions for a variety of neurological disorders (HD, PD, and MSA, among others). Most of these same CROs also have expertise in NHP imaging (MRI/DTI) and behavioral assessments across multiple domains. It seems feasible that AAV-mediated HD macaques could be produced in sufficient numbers to appropriately power therapeutic studies, using the outcome measures established in the current study.

      Reviewer #2 (Public Review):

      1. “The major weaknesses are the manner in which the data is presented”

      We replotted all of the figures with improved color palettes and larger font sizes to make them easier to read. We also added additional details throughout the results section to aid in clarity and improve readability.

      2. “The authors would benefit from talking more about their model in the introduction and including references to some key points. For example, there has been critical new data in the field showing the importance of poly (CAG) in disease, not necessarily poly(Q), and the community will want to know (and not be required to look up), the nature of the transgene. Is it a pure CAG repeat? A mixed repeat? If it is pure, do they see or could they measure somatic expansion in the various brain regions impacted? How does that data match the phenotypes seen? Since this is a transgene, there is no possibility for the exon1/intron1 splicing variant to appear - how does this impact their interpretation”

      Further details regarding the transgene have been added to the Viral Vector Section of the Methods (Lines 531-550). The repeat is not pure and contains a single CAA interruption. The glutamine encoded repeat for HTT85Q contained 83 pure CAG repeats, followed by a single CAA/CAG cassette, while the glutamine encoded sequence for HTT10Q contained 8 pure CAG repeats followed by a single CAA/CAG cassette. Both constructs contained a proline stretch distal to the glutamine repeat in the following allelic conformation where QT represents the total glutamine length:

      HTT85Q: QT\=85, (CAG)83(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      HTT10Q: QT\=10, (CAG)8(CAACAG)1(CCGCCA)1(CCG)7(CCT)2

      There are plans to probe for somatic expansion in various brain regions, including the caudate and putamen, as well as several distal cortical regions. That analysis is ongoing and not in the scope of the present manuscript; however, these analyses are now mentioned in the discussion section (lines 540-560), as well as a discussion on the ability to either remove or duplicate the CAA/CAG cassette to potentially increase or decrease the rate of disease progression, respectively, based on the work of Ciosi et al. 2019. Additionally, Reviewer 2 is correct in that the lack of intronic sequences in the transgene precludes the formation of splicing variants, such as the exon1/intron1 variant, which we know is pathological based on the work of Bates et al. This drawback has been added to the discussion, along with other limitations of this viral vector-based model (Lines 503-517).

      3. “What about RAN translations? Is RAN translation noted at all in this over-expression model? How does that contribute (or not) to the progressive phenotype they see in their NHPs?”

      We are also curious regarding the assessment of toxic protein products from RAN translation of the expanded repeat sequence in this model. These studies are planned, and the results of these assays will be included in a future manuscript describing other ongoing post-mortem evaluations in this model.

    1. Author Respose

      Reviewer #1 (Public Review):

      This manuscript reports a new genetically encoded neuronal silencer BoNT-C. They show that it fully blocks neurotransmission in two classes of Drosophila motor neurons (Is and 1b; tonic and phasic, respectively). They also update a GCaMP postsynaptic reporter SynapGCaMP to express GCaMP8f instead of 6f. They selectively silence 1b or 1s neurons to disambiguate the neurotransmission properties of each neuron. Finally, they show that silencing either 1b or 1s neurons does not induce heterosynaptic structural or functional plasticity (only neuron ablation triggers plasticity). The data are convincing. The new silencing tool will be widely used.

      We thank this reviewer for his positive assessment of our study and for highlighting the utility of the new silencing tool presented in this study.

      Reviewer #2 (Public Review):

      The conclusions of this paper are properly supported by the provided data.

      Overall this work opens a new window to examine novel aspects of heterosynaptic structural and functional plasticity.

      We also thank this reviewer for his positive assessment of our study and for putting the importance of our findings in context.

      Reviewer #3 (Public Review):

      The strength of the manuscript by Han et al. is the comprehensive characterization of BoNT-C, showing that it truly abolishes all evoked and mini responses without structural alteration of the NMJ. Based on this, the authors then show that ablation of all neurotransmission in either Ib or Is does not cause any compensatory changes (neither functional nor structural) in the 'other' (i.e. looking at Is when silencing Ib or looking at Ib when silencing Is).

      The weakness of the manuscript lies in the modest gain over the previous work. Specifically, Aponte-Santiago had already shown that many parameters are not changed (in Ib when Is is perturbed, or in Is when Ib is perturbed), including that 'the Is terminal failed to show functional or structural changes following loss of the coinnervating Ib input' (quote from 2020 paper). Hence, the only major difference is that Han et al now show that Ib also does not really change when Is is silenced. Aponte-Santiago also clearly showed a ~50% EJP reduction when Is or Ib are perturbed alone, and adding these two equals wild type. The highly emphasized finding of Han et al. that (quote) ' composite values of Is and Ib neurotransmission can be fully recapitulated by isolated physiology from each input' quite obviously follows from the one key finding that one does not affect the other, as mentioned above in the strengths. The wording is a bit odd, but really adding Is (with Ib perturbed) and Ib (with Is perturbed) inputs is really not adding much over either the main finding nor the previous work.

      We thank this reviewer for his/her/their assessment of our study and for highlighting the strengths in characterizing the impact of BoNT-C expression at the NMJ. We also understand and appreciate the criticisms raised. It is important to note from the outset that the motivation and central goal of this study was not primarily to mechanistically dissect heterosynaptic plasticity between tonic and phasic motor inputs at the Drosophila NMJ. Rather, it was to develop an approach that would, for the first time, enable accurate isolation of complete neurotransmission from entire MN-Is or MN-Ib NMJs (both miniature and evoked transmission). By the reviewer’s own admission, we were entirely successful at achieving this central goal in our comprehensive characterization of BoNT-C.

      Next, the reviewer raises the valid question about whether this achievement is a significant advance over previous work, and discusses recent experimental findings regarding heterosynaptic plasticity at the fly NMJ. We want to emphasize here that having a tool that is capable, for the first time, of accurately discriminating complete transmission from Is vs Ib alone is a major advance, one that it is not clear the reviewer sufficiently appreciates. As summarized in Fig. 1, no previous attempts have been successful in accurately isolating synaptic transmission between Is vs Ib synapses. In particular, no previous approach was capable of isolating miniature activity from Is vs Ib, and as we show in our manuscript, miniature events exhibit major differences between the two inputs. Thus, without isolating miniature transmission, one cannot know baseline synaptic function in Is vs Ib nor whether heterosynaptic functional plasticity has been induced. Further, we detail major confounds with some of the previous approaches the reviewer alludes to in prior studies, including selective optogenetic stimulation.

      Finally, the reviewer discusses at length recent findings regarding heterosynaptic plasticity and questions whether the new insights revealed by BoNT-C provides a sufficient advance. In particular, the reviewer refers to previous work published in 2020 and 2021, where important initial insights into Is vs Ib structure and transmission after differential manipulations to either input was reported. The reviewer appears to believe that it was settled in these studies that no heterosynaptic functional plasticity was induced.

      However, a critical point that the reviewer appears not to appreciate is that while the two previous studies on heterosynaptic plasticity at the Drosophila NMJ were able to assess structural plasticity (AponteSantiago et al., 2020; Wang et al., 2021), no accurate or quantitative conclusions can be made about heterosynaptic functional plasticity from these studies. This is due to the authors not knowing what baseline synaptic function is at Is vs Ib (miniature frequency, miniature amplitude, and evoked transmission), so that in their manipulations they cannot accurately determine whether any functional changes are observed after their manipulations. Further complicating the interpretation of the previous studies is that at the muscle 1 NMJ (2020 study), like the muscle 4 NMJ (2021 study), ~30% of these NMJs fail to be innervated by a Is input in wild-type larvae. This major confound makes it difficult to know how or whether adaptive plasticity is induced in wild-type NMJs with or without Is innervation (since, interestingly, evoked transmission does not appear to change in wild-type m1 or m5 NMJs with or without a Is input), and then to determine whether any heterosynaptic plasticity is induced. Indeed, we have also struggled with how to accurately determine whether synaptic function changes compared to baseline throughout our studies at earlier stages, despite the fact that the muscle 6/7 NMJ we use in our study does not suffer from the variable Is innervation confounds observed at muscle 4 (Wang et al., 2021) and muscle 1 (Aponte-Santiago et al., 2020).

      Respectfully, we contend that the only way one can accurately and quantitatively determine baseline synaptic transmission (miniature amplitude, frequency, evoked, quantal content), and whether any changes are observed following manipulations to Is or Ib, is to fully and accurately recapitulate wild type (blended Is+Ib) neurotransmission from isolated Is vs Ib transmission. This is why we believe the data shown in Fig. 7 (and also Fig. S7 in the revised manuscript) is so important. It is true that numerous previous studies established relative and qualitative differences between Is vs Ib (miniature events are larger at Is relative to Ib, Is drives larger depolarization in response to single synaptic stimulation over Ib, etc). However, in no case did previous studies accurately assess baseline Is vs Ib synaptic function from entire inputs, and therefore could not conclude with certainty whether heterosynaptic functional plasticity was induced.

      On a different but somewhat similar topic, UAS-BoNT-C is not a new tool. I am a bit put off by the wording ' We have developed a botulinum neurotoxin, BoNT-C...'. More on this and the way the previous BoNT-C paper (Backhaus et al., 2016) is cited in the detail comments below in the recommendations for the authors.

      We understand these points raised by the reviewer. Our BoNT-C transgenic line is indeed a new tool, the only one in which synaptic transmission has ever been electrophysiologically characterized and shown to completely silence synaptic transmission in Drosophila. That being said, in retrospect, we can appreciate that the term “developed” might imply a level of innovation that reasonable people can disagree about. We have therefore elected to change the apparently offensive wording to “We have employed a botulinum neurotoxin, BoNT-C…” in the abstract of the revised manuscript.

      Additionally, the manuscript does not really dive into an analysis of phasic versus tonic functions (that's just a correlation with the Is and Ib dominant modes of function).

      We absolutely agree that selective silencing by BoNT-C now enables a rigorous study of tonic vs phasic neurotransmission at MN-Is vs MN-Ib NMJs, but that in the current manuscript we have not focused on this interesting question. We have adopted the convention the field has used to classify MN-Is and MN-Ib subtypes based on their apparent firing modes as “phasic” vs “tonic”, but like previous studies, we have not analyzed these functional distinctions on a deeper level. Although the focus of the current manuscript was to establish the properties of BoNT-C and highlight its utility as a tool for the field, we are now in the process of preparing an entirely new manuscript focused on just this reviewer’s question about the differences in tonic vs phasic synaptic physiology. This eight-figure manuscript will be entitled “Electrophysiological properties and nanoscale distinctions that define tonic vs phasic glutamatergic synapses” and is focused on the central question raised by the reviewer - how and why synaptic transmission differs between tonic vs phasic inputs. While this interesting question is outside the scope of the current manuscript, we will submit this new manuscript within the next few months, which is based on new experimental insights now enabled by selective BoNT-C silencing established in the current manuscript.

      Finally, since the authors show that loss of Is or Ib function does not cause any change in the other, we are left to wonder what actually DOES cause heterosynaptic plasticity. TNT or rpr DO cause some heterosynaptic plasticity and they also DO cause some structural changes - but whether the structural changes themselves are important here remains unclear. Substantial progress would have been to take the starting point that BoNT-C does not cause heterosynaptic plasticity, and then identify the signal that does (is it morphology? or signaling between Is and Ib? Or with the muscle?).

      We certainly agree with the reviewer that understanding how heterosynaptic plasticity is induced is an important question and worthy of additional investigation. As stated above, the focus of our current study was to establish the tool, BoNT-C, that will now enable a variety of fascinating and important future studies, both at understanding how and why synaptic strength differs between tonic vs phasic synapses and also how heterosynaptic plasticity signaling occurs at the NMJ. It required substantial time and experimental effort to establish that BoNT-C works to cleanly silence transmission without inducing structural and functional plasticity in the current manuscript (Figures 1-7 and several supplemental figures). Respectfully, we believe it is unreasonable to expect all of this data to be relegated to a “starting point” to then go on and probe heterosynaptic plasticity in more detail, all compressed into a single paper.

      It appears this reviewer is particularly interested in heterosynaptic plasticity, which we agree is a fascinating topic. First, we should clarify that in our experiments, TNT expression does NOT induce any heterosynaptic structural or functional plasticity (see Figures 6 and Table S2), at least in our studies at m6/7, m12/13, and m4 NMJs. Rather, TNT expression alters synaptic structure in the neuron in which it is expressed (“intrinsic structural plasticity”, Fig. 6), but does not induce any changes to the convergent input. Hence, the only evidence for actual heterosynaptic plasticity is the rather minor adaptations in synaptic structure and function observed following ablation of Is motor inputs (Fig. 6 and 8).

      In addition to the important insights revealed by BoNT-C in accurately distinguishing tonic vs phasic transmission outlined above, it appears that the reviewer does not fully appreciate the mechanistic constraints that the new BoNT-C tool reveals about heterosynaptic signaling. We would therefore like to highlight the key insights our study has revealed specifically about heterosynaptic plasticity. First, we show that at the muscle 6/7 NMJ, loss of MN-Ib completely eliminates Is innervation – this was not the finding reported in the 2020 study (Ib ablation was not reported in the 2021 study). Rather, AponteSantiago et al. 2020 reported that elimination of Ib did not trigger compensatory changes in active zone or bouton numbers of the Is input, no were compensatory increases in the Is EPSP reported. This may be due to the confounding variable Is innervation at the muscle 1 and muscle 4 NMJs used in the previous studies. Second, to what extent miniature transmission changes after manipulating activity from Is vs Ib could not be accurately assessed in previous studies because spontaneous activity persists following TNT expression as does innervation following rpr.hid expression. Third, and perhaps most important, our study is the only one that can demonstrate no heterosynaptic functional plasticity is induced by the physical presence but functional silencing of neurotransmitter release between tonic vs phasic inputs at NMJs with consistent innervation by both Is and Ib inputs.

      It is clear to us now that we did not do a sufficient job of emphasizing these advances our study has now revealed about the baseline and heterosynaptic relationships between Is vs Ib. We have added additional details throughout the revised manuscript to ensure these insights are highlighted in an effort for the reader to better appreciate the importance of this study.

      Overall, while an initial reading of the manuscript sounded rather exciting, a deeper analysis of the work in context of the literature of the last few years diminishes my enthusiasm for the novelty and progress provided.

      We have responded to the major criticisms raised by this reviewer above and hope that he/she/they can more fully appreciate the importance of the new tool we developed, the impact it will have on the field in opening new studies on tonic vs. phasic transmission, and establishing the rules of heterosynaptic plasticity between convergent tonic and phasic inputs on common targets.

    1. Author Response

      Reviewer #1 (Public Review):

      It should also be noted that their immunohistochemical studies of human fetal tissue for TBX5 and PTK7 are not convincing. There appears to be widespread staining of multiple cell types, suggesting either very broad expression of both genes or poor specificity of the primary antibodies.

      We appreciate the reviewer’s comment that the immunohistochemistry staining does not provide definitive evidence for the functional importance of TBX5 and PTK7 in PUV, however these images do confirm that the proteins are ‘in the right place at the right time’ during normal human urinary tract development. We have updated the discussion on page 19, line 441-445 to emphasise this. To further support a putative role for these proteins in urinary tract development we have added additional images from a second human embryo at the same gestation which confirms these distinct patterns of staining (Figure 8 – figure supplement 1 on page 14, lines 313-317). Even if these proteins can also be detected in other tissues or cell types, this does not detract from this idea, as in other locations the proteins may have redundant or different roles. 

      PUVs have not been described as a clinical manifestation of disease associated with mutations of either gene in humans.

      The reviewer is correct that rare variants affecting TBX5 and PTK7 have not previously been associated with PUV. They have however been associated with other developmental anomalies (as stated in the discussion on page 18, line 408-411 and page 19, line 434-437) confirming a clear role in embryonic development for both these genes.

      The fact that rare variant association testing did not identify an increased burden of rare, likely deleterious variants in these two genes (although with limited power in this cohort) suggests that PUV is not driven by ultra-rare, highly penetrant alleles in these genes. However, the identification of common and low-frequency variants using GWAS suggests a complex mode of inheritance for PUV likely in combination with maternal_/in utero_ factors. As with other complex traits, these signals provide potential insights into the underlying biology of this disease as opposed to the diagnostic implications of conventional monogenic gene discovery associated with purely Mendelian conditions. A paragraph on the Mendelian/complex trait implications of the findings of the study has been incorporated into the discussion (page 21-22, line 594-502).       

      Discuss how variants in either gene or in the patterns of structural variants that they found associated with PUV intersect with sex to result in this exclusively male condition.

      The fact that PUV is a uniquely male disease is most likely the result of differences in urethra and bladder development and length differences in urethra between males and females. Sex hormones may also potentially result in tissue-specific differences in gene expression (Ober, Loisel, and Gilad 2008). We have added a paragraph into the discussion to clarify this (page 20, line 454-463) as well as clarified the results of the chromosome X and sex-specific analyses (page 7, lines 149-155; see also Reviewer 2, point 5 below) as suggested. 

      Reviewer #2 (Public Review):

      Major:

      1. The replication study is problematic given that different genotyping methods are used for cases (targeted KASP) versus controls (WGS). This may introduce differential bias. Moreover, the ancestry of the control cohort (UK-based) does not seem to be well matched to the cases (predominantly German and Polish), and the lack of genome-wide data for the cases precludes proper adjustment for population stratification. The case-control design is also imbalanced in the replication study. The authors should reconsider their replication strategy to include a more balanced cohort with ancestry-matched controls and uniform genotyping. As an alternative, genome-wide genotyping of the replication case cohort would significantly enhance the study and should be considered.

      Many thanks to the reviewer for their valuable comments regarding the replication study case-control cohort. While different sequencing technologies were used to compare allele counts at the lead variants in the replication study (KASP genotyping for cases vs WGS for controls), both techniques exhibit > 99.5% accuracy and are subjected to variant level quality control metrics. Only individuals with reliably called genotypes were included in the replication analysis. This has been clarified in the methods section (page 30, line 693).

      We were able to obtain genome-wide genotyping data for 204 of the 395 European cases in the replication cohort. While (despite sustained effort on our part) we were unable to analyze this data jointly with the control cohort in the 100KGP due to enforced limitations on data sharing, we were able to demonstrate similar ancestry of the replication study cases and controls:  we performed PCA on a set of ~80,000 overlapping autosomal, high-quality, LD-pruned variants with MAF > 10% and projected the cases and controls separately onto (the same) data from the 1000 Genomes Project (Phase 3) labelled by ‘population’ (Figure 5). This clearly demonstrates that both cohorts have homogeneous European ancestry, as stated now in the results (page 8, lines 166-168).

      We note with thanks the reviewer’s comments regarding the case-control imbalance in the replication study which can sometimes result in a type 1 error. To address this, the case control ratio was reduced from 1:27 to 1:10.5 by including only the 4,151 male controls from the cancer cohort of the 100KGP. The results remained significant for both lead variants and have been updated in the manuscript (page 8, line 162-176; Table 2).

      When the number of controls was reduced to 500 males (a case:control ratio of 1:1.3), rs10774740 (TBX5 locus) remained significant demonstrating that case-control imbalance was not driving the observed signal (P\=9.9x10-3; OR 0.77; 95% CI 0.63-0.94). rs144171242 (PTK7 locus) however did not reach significance due to insufficient power (P\=0.06; OR 2.24; 95% CI 0.93-5.36). For a rare variant such as rs144171242 (MAF ~ 1%), a replication study with 500 controls is only powered to detect association with large effect size (OR > 3.5). A case:control ratio of ~1:10 is therefore optimal to maximize power to detect association, while minimizing unnecessary noise from excess controls. This has been added to the results section of the manuscript (page 8-9, lines 178-184).

      2. I am reassured that the TBX5 signal remains genome-wide significant in European-only analysis. However, the signal at PTK7 appears much less robust - it has borderline statistical significance (especially given that the authors test for all rare and common variants across the genome) and is represented by a single variant with a relatively rare risk allele that is differentially distributed by ancestry. Therefore, I would like to see more information for this specific signal:

      Information on the depth of coverage and the quality of the top variant

      This has been incorporated into the manuscript for both lead variants (Page 7, lines 142-145). For rs144171242 at the PTK7 locus, the meanDP was 29.34 and the meanGQ was 75.59.

      Information if the top PTK7 variant remain genome-wide significant after application of genomic control. Of note, the calculation of genomic inflation is dependent on sample size - lambda of 1.05 may represent an underestimate given low power of the cohort, and this point deserves at least a comment. Some methods correcting lambda for sample size have been proposed, and the authors should consider applying these methods.

      We appreciate the reviewer’s comments that the value of lambda may be affected by sample size and have added a comment to this in the manuscript (Page 7, line 136-137). Despite extensive searching, we were unable to find a recent published example of how to correct lambda for sample size and would be grateful if the reviewer could suggest a reference for this.

      To answer the reviewer’s specific question, application of genomic control to the lead variant at PTK7 results in P\=4.37x10-8 which remains below the threshold for conventional genome-wide significance. However, while the genomic inflation factor provides a reasonable indication of possible confounding by population structure, there are recognized limitations to applying it as a corrective factor as it assumes that all variants are confounded i.e., the same correction is applied irrespective of differences in population allele frequency which can be insufficient for some variants and lead to a loss of power in others. Furthermore, in addition to sample size, lambda can vary with heritability and disease prevalence (Yang et al. 2011) and its use for correction can therefore be too conservative and reduce power to detect significant associations. In this manuscript we therefore chose to use the mixed model approach (as part of SAIGE – detailed in the methods on page 28, lines 647-648), which has largely superseded older methods such as genomic control, to robustly correct for both population structure and cryptic relatedness and minimize false positive associations (Shin and Lee 2015).

      This locus requires more robust replication as discussed above. If more robust replication study is not possible, additional functional studies could provide more evidence in support of this locus.

      Please refer to point 1 regarding the revised and more robust evidence of replication. 

      3. There is no validation of sensitivity and specificity of SV detection by variant size or type (e.g. inversions, deletions, duplications). Also, since burden differences are not replicated independently, the authors should stress the exploratory nature of these analyses.

      We appreciate the reviewer’s comment that there is no independent validation of SV detection (e.g., by microarray or long-read sequencing) and this was reported as a limitation of our study in the discussion (page 22-23, line 520-524). However, one of the main strengths of this study is the use of clinical-grade WGS data where all samples have been sequenced on the same platform and undergone variant calling using the same bioinformatics pipeline. This essentially eliminates confounding due to differences in data generation and processing and the sensitivity and specificity of SV detection will therefore be the same for both cases and controls.

      We agree with the reviewer that the SV analyses have not yet been replicated independently and, as they suggest, have stressed the exploratory nature of the findings in the discussion (page 21, line 491-493).

      In the discussion (especially second paragraph, but also throughout), the authors overemphasize multi-ancestry nature of their study. The reality is that the included non-Europeans are very small in numbers (18 SAS cases, 11 AFR cases, and 14 admixed cases). I would suggest for the authors to specifically state these case counts and make it clear that expanded efforts to recruit non-Europeans are still needed given these very low numbers.

      We appreciate the reviewer’s comment about the overemphasis on the multi-ancestry nature of the study and the small absolute numbers of individuals included, however as a proportion of the cohort, a third of the cases are non-European: 14% are of South Asian ancestry, 8% are of African ancestry and 11% are admixed. This breakdown comprises a greater proportion of non-white European ancestry individuals than the UK as a whole (DOI: 10.5257/census/aggregate-2001-2), where the discovery cohort was based. This provides evidence that our study eliminates at least some of the Euro-centric bias present in existing genetic and genomic literature, at least as far as the UK population is concerned. Clearly, global studies fairly representing all populations would be needed to address this issue perfectly. The case counts were reported in Table 1 but we have now referenced the low absolute numbers and included the reviewer’s suggestion about expanding efforts to recruit non-European populations in the main text (page 22, line 518-520). We have also edited paragraph two of the discussion in response to the reviewer’s comments (page 17, line 387-398).   

      Supplemental figure 2 -provide case-control counts in each ancestral group (Y axis).

      These have been added to the figure legend of Figure 6 – supplemental figure 4 (previously Figure 5 - supplemental figure 2).

      Supplemental figure 3 is misleading since allelic frequencies in the cases are pooled and are not available individually for all depicted populations.

      Figure 5 - supplemental figure 3 has been removed and replaced by Figure 6 – supplemental figure 3 to show only the individual case, control and gnomAD AF by ancestry for AFR, SAS and EUR population groups instead of using the pooled allele frequencies.

      5. I did not see details of chr. X analysis. This is important given that the case group involves only Males and control group involves both Males and Females. Also, please explain how sex was used as a fixed effect (as stated in the methods) given that the case cohort is 100% male.

      We thank the reviewer for their insightful comments. Sex was used as a covariate (or fixed effect) to control for the anatomical differences in development of the urethra (and in utero hormonal changes) between the sexes in the control cohort (clarified in the methods, page 28, lines 651-653). Given the PheWAS findings (page 13, line 292-297) reveal an association between the lead variant near TBX5 and female genital prolapse and urinary incontinence, this suggests that while women do not develop PUV (due to differences in urethral development) they may manifest other lower urinary tract phenotypes. In theory, removing the female individuals from the control cohort should therefore strengthen the association as the signal would not be diluted by ‘affected’ women (i.e., those with potentially unknown lower urinary tract phenotypes). We tested this by performing a sex-specific male-only GWAS and found that the strength of association at both lead variants increased. The results of this have been added to the manuscript (page 7, line 149-155).

      The results of the chromosome X rare variant analysis are shown on the Manhattan plot (Figure 9), with no significant genes identified. We have added chromosome X to the mixed-ancestry and European GWAS as suggested (with no significant results) and the Manhattan and Q-Q plots have been updated in Figure 2 and Figure 6. The number of analyzed variants in each analysis has also been updated accordingly.

    1. Author Response

      Reviewer #2 (Public Review):

      Feeding behaviour in C. elegans has been extensively studied over decades. Several methods  of measuring feeding exist, but none can directly measure both pumping and locomotion  behaviour in freely-moving worm populations. The authors have developed a new  imaging-based method for automated detection of pharyngeal pumping events in freely moving

      C. elegans populations, and can thus simultaneously measure pumping and locomotion  behaviour in tens of worms, at a single-worm, single-pump resolution that is not possible with  previous methods. This user-friendly method can be applied to several research directions, such  as large-scale foraging, behavioural coordination, and high throughput screening.

      The authors designed their new method to be broadly applicable and user-friendly, for easy  adaptation in other research labs. However, adding direct evidence to show that "the method is  relatively insensitive to the optical instrument used" will better support this claim of wider  application.

      We appreciate the reviewer’s suggestion to show evidence that our method will also work on  data acquired on different microscopes. We now present data obtained on a second  epi-fluorescent microscope, which was downscaled and analyzed in Fig. 1H-J.

      The authors carefully benchmarked their new method against expert annotations and existing  results from previous methods, to both validate their method and reveal additional advantages.  They also assessed potential pitfalls of the method such as by examining the effect of  fluorescence imaging on the behavioural outcome, albeit only at the timescale of minutes. The  effect of longer-term fluorescence imaging should be further explored, which is relevant for  large-scale foraging experiments that the authors discussed. It could be helpful to determine the  maximum total exposure for the method to still be valid, both in terms of pump detection (which  could be sensitive to photobleaching) and behavioural modulation (which could be sensitive to  higher phototoxicity).

      We thank the reviewer for this comment. In response to their comment and related comments  by the other reviewers, we have provided bleaching curves and evidence of long-term imaging  to show the potential of the methods for longer scale assays. We found that with our illumination  intensity (see methods), bleaching was significant at a time scale of ~1h. We then added  triggered illumination and could extend the recording time to ~5 h (Methods). Additionally, we  perform a supplementary control for viability of worms exposed to continuous light (not  triggered) for 5 hrs. We do not observe any apparent phototoxic effect.

      Overall, the manuscript is well-written and the results are clearly presented both in terms of  statistics and interpretation. Methodological details are well-documented and openly accessible.

      We thank the reviewer for their positive view of our work and their appreciation for our efforts to  document both data and software.

      Reviewer #3 (Public Review):

      In this manuscript, the authors present a method for simultaneous assessment of pharyngeal  pumping (feeding) and locomotion in many C. elegans simultaneously. In this technique,  imaging of the fluorescent labeled pharynx provides a measure of velocity and pumping rate,  through analysis of the spatial variations in fluorescence.

      The technique is clearly described, well-validated, and yields some novel results. It has the  advantage that it can be performed using microscopes found in many C. elegans laboratories.

      We appreciate that the reviewer recognizes the wide applicability of the method across many C.  elegans  laboratories.

      Some limitations of the method include its reliance on fluorescence imaging, which is a  hindrance to genetic analysis, computational intensiveness, and phototoxic effects of  fluorescence excitation that are not fully explored in the manuscript.

      The authors show the utility of their method by assessing pharyngeal pumping and motor  behavior (1) during development, (2) in the presence or absence of food, and (3) in the  presence of two mutations affecting feeding.

      Although I understand these are proof-of-principle demonstrations, I still came away feeling  underwhelmed by these examples. I did not see any results here that could not have been  obtained fairly easily with conventional techniques.

      We appreciate the constructive criticism of the reviewer and highlight in the revised version the  fact that using conventional techniques such studies would require tens of hours of experiment  time. We would like to emphasize the comparisons in Table 1 where we show other methods  and their current limitations. Obtaining a dataset such as in Figure 3 which comprises a total of  34 worm-hours of pumping observation from unrestrained animals is to our knowledge currently  impractical with competing methods. We would like to remind the reviewer that, using our  method we were able to reveal bimodal distributions within a population as illustrated, for  instance, in Fig. 3F, 4B, and 4F. These observations are not possible when the single worm  resolution is not accessible or when large statistics are not feasible as it happens with previous  methods.

      Given these limitations, I feel the method's eventual impact in the field will be relatively small.

      In this study, we present a method allowing performing behavioral studies on worm populations  at high throughput and reduced costs. Such a technique opens the door to many laboratories  that can not do EPG recordings or microfluidics due to the technical difficulties, or that want to  study animals in their normal plate context. We also would like to emphasize that there are already more than 1500 strains containing myo-2  promoter transgene available on CGC, which  would be amenable to our imaging approach. These transgenic strains form broad classes of  interest, such as thermotolerance, ER stress resistance, aging and neural-circuit specific genes.

      Pharyngeal pumping has also been used as a read-out for pharmacological screens, for  example, bacteria pre-loaded with pharmacological agents are tested for their effect on  pharyngeal pumping rate. Pharaglow offers a high-throughput and sensitive method to measure  the pumping rate. This will benefit the field who use C. elegans  pumping for pharmacological  screens, and pave the way for the researchers who plan to use but are hindered by existing  techniques.

    1. Author Response

      Reviewer #1 (Public Review):

      The authors evaluate the involvement of the hippocampus in a fast-paced time-to-contact estimation task. They find that the hippocampus is sensitive to feedback received about accuracy on each trial and has activity that tracks behavioral improvement from trial to trial. Its activity is also related to a tendency for time estimation behavior to regress to the mean. This is a novel paradigm to explore hippocampal activity and the results are thus novel and important, but the framing as well as discussion about the meaning of the findings obscures the details of the results or stretches beyond them in many places, as detailed below.

      We thank the reviewer for their constructive feedback and were happy to read that s/he considered our approach and results as novel and important. The comments led us to conduct new fMRI analyses, to clarify various unclear phrasings regarding our methods, and to carefully assess our framing of the interpretation and scope of our results. Please find our responses to the individual points below.

      1) Some of the results appear in the posterior hippocampus and others in the anteriorhippocampus. The authors do not motivate predictions for anterior vs. posterior hippocampus, and they do not discuss differences found between these areas in the Discussion. The hippocampus is treated as a unitary structure carrying out learning and updating in this task, but the distinct areas involved motivate a more nuanced picture that acknowledges that the same populations of cells may not be carrying out the various discussed functions.

      We thank the reviewer for pointing this out. We split the hippocampus into anterior and posterior sections because prior work suggested a different whole-brain connectivity and function of the two. This was mentioned in the methods section (page 15) in the initial submission but unfortunately not in the main text. Moreover, when discussing the results, we did indeed refer mostly to the hippocampus as a unitary structure for simplicity and readability, and because statements about subcomponents are true for the whole. However, we agree with the reviewer that the differences between anterior and posterior sections are very interesting, and that describing these effects in more detail might help to guide future work more precisely.

      In response to the reviewer's comment, we therefore clarified at various locations throughout the manuscript whether the respective results were observed in the posterior or anterior section of the hippocampus, and we extended our discussion to reflect the idea that different functions may be carried out by distinct populations of hippocampal cells. In addition, we also now motivate the split into the different sections better in the main text. We made the following changes.

      Page 3: “Second, we demonstrate that anterior hippocampal fMRI activity and functional connectivity tracks the behavioral feedback participants received in each trial, revealing a link between hippocampal processing and timing-task performance.

      Page 3: “Fourth, we show that these updating signals in the posterior hippocampus were independent of the specific interval that was tested and activity in the anterior hippocampus reflected the magnitude of the behavioral regression effect in each trial.”

      Page 5: “We performed both whole-brain voxel-wise analyses as well as regions-of-interest (ROI) analysis for anterior and posterior hippocampus separately, for which prior work suggested functional differences with respect to their contributions to memory-guided behavior (Poppenk et al., 2013, Strange et al. 2014).”

      Page 9: “Because anterior and posterior sections of the hippocampus differ in whole-brain connectivity as well as in their contributions to memory-guided behavior (Strange et al. 2014), we analyzed the two sections separately. “

      Page 9: “We found that anterior hippocampal activity as well as functional connectivity reflected the feedback participants received during this task, and its activity followed the performance improvements in a temporal-context-dependent manner. Its activity reflected trial-wise behavioral biases towards the mean of the sampled intervals, and activity in the posterior hippocampus signaled sensorimotor updating independent of the specific intervals tested.”

      Page 10: “Intriguingly, the mechanisms at play may build on similar temporal coding principles as those discussed for motor timing (Yin & Troger, 2011; Eichenbaum, 2014; Howard, 2017; Palombo & Verfaellie, 2017; Nobre & van Ede, 2018; Paton & Buonomano, 2018; Bellmund et al., 2020, 2021; Shikano et al., 2021; Shimbo et al., 2021), with differential contributions of the anterior and posterior hippocampus. Note that our observation of distinct activity modulations in the anterior and posterior hippocampus suggests that the functions and coding principles discussed here may be mediated by at least partially distinct populations of hippocampal cells.”

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence [...]

      2) Hippocampal activity is stronger for smaller errors, which makes the interpretationmore complex than the authors acknowledge. If the hippocampus is updating sensorimotor representations, why would its activity be lower when more updating is needed?

      Indeed, we found that absolute (univariate) activity of the hippocampus scaled with feedback valence, the inverse of error (Fig. 2A). We see multiple possibilities for why this might be the case, and we discussed some of them in a dedicated discussion section (“The role of feedback in timed motor actions”). For example, prior work showed that hippocampal activity reflects behavioral feedback also in other tasks, which has been linked to learning (e.g. Schönberg et al., 2007; Cohen & Ranganath, 2007; Shohamy & Wagner, 2008; Foerde & Shohamy, 2011; Wimmer et al., 2012). In our understanding, sensorimotor updating is a form of ‘learning’ in an immediate and behaviorally adaptive manner, and we therefore consider our results well consistent with this earlier work. We agree with the reviewer that in principle activity should be stronger if there was stronger sensorimotor updating, but we acknowledge that this intuition builds on an assumption about the relationship between hippocampal neural activity and the BOLD signal, which is not entirely clear. For example, prior work revealed spatially informative negative BOLD responses in the hippocampus as a function of visual stimulation (e.g. Szinte & Knapen 2020), and the effects of inhibitory activity - a leading motif in the hippocampal circuitry - on fMRI data are not fully understood. This raises the possibility that the feedback modulation we observed might also involve negative BOLD responses, which would then translate to the observed negative correlation between feedback valence and the hippocampal fMRI signal, even if the magnitude of the underlying updating mechanism was positively correlated with error. This complicates the interpretation of the direction of the effect, which is why we chose to avoid making strong conclusions about it in our manuscript. Instead, we tried discussing our results in a way that was agnostic to the direction of the feedback modulation. Importantly, hippocampal connectivity with other regions did scale positively with error (Fig. 2B), which we again discussed in the dedicated discussion section.

      In response to the reviewer’s comment, we revisited this section of our manuscript and felt the latter result deserved a better discussion. We therefore took this opportunity to extend our discussion of the connectivity results (including their relationship to the univariate-activity results as well as the direction of these effects), all while still avoiding strong conclusions about directionality. Following changes were made to the manuscript.

      Page 11: Interestingly, we observed that functional connectivity of the anterior hippocampus scaled negatively (Fig. 2C) with feedback valence, unlike its absolute activity, which scaled positively with feedback valence (Fig. 2A,B), suggesting that the two measures may be sensitive to related but distinct processes.

      Page 11: Such network-wide receptive-field re-scaling likely builds on a re-weighting of functional connections between neurons and regions, which may explain why anterior hippocampal connectivity correlated negatively with feedback valence in our data. Larger errors may have led to stronger re-scaling, which may be grounded in a corresponding change in functional connectivity.

      3) Some tests were one-tailed without justification, which reduces confidence in the robustness of the results.

      We thank the reviewer for pointing us to the fact that our choice of statistical tests was not always clear in the manuscript. In the analysis the reviewer is referring to, we predicted that stronger sensorimotor updating should lead to stronger activity as well as larger behavioral improvements across the respective trials. This is because a stronger update should translate to a more accurate “internal model” of the task and therefore to a better performance. We tested this one-sided hypothesis using the appropriate test statistic (contrasting trials in which behavioral performance did improve versus trials in which it did not improve), but we did not motivate our reasoning well enough in the manuscript. The revised manuscript therefore includes the two new statements shown below to motivate our choice of test statistic more clearly.

      Page 7: [...] we contrasted trials in which participants had improved versus the ones in which they had not improved or got worse (see methods for details). Because stronger sensorimotor updating should lead to larger performance improvements, we predicted to find stronger activity for improvements vs. no improvements in these tests (one-tailed hypothesis).

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively. Because we predicted to find stronger activity for improvements vs. no improvements in behavioral performance, we here performed one-tailed statistical tests, consistent with the direction of this hypothesis. Improvement in performance was defined as receiving feedback of higher valence than in the corresponding previous trial.

      4) The introduction motivates the novelty of this study based on the idea that thehippocampus has traditionally been thought to be involved in memory at the scale of days and weeks. However, as is partially acknowledged later in the Discussion, there is an enormous literature on hippocampal involvement in memory at a much shorter timescale (on the order of seconds). The novelty of this study is not in the timescale as much as in the sensorimotor nature of the task.

      We thank the reviewer for this helpful suggestion. We agree that a key part of the novelty of this study is the use of the task that is typically used to study sensorimotor integration and timing rather than hippocampal processing, along with the new insights this task enabled about the role of the hippocampus in sensorimotor updating. As mentioned in the discussion, we also agree with the reviewer that there is prior literature linking hippocampal activity to mnemonic processing on short time scales. We therefore rephrased the corresponding section in the introduction to put more weight on the sensorimotor nature of our task instead of the time scales.

      Note that the new statement still includes the time scale of the effects, but that it is less at the center of the argument anymore. We chose to keep it in because we do think that the majority of studies on hippocampal-dependent memory functions focus on longer time scales than our study does, and we expect that many readers will be surprised about the immediacy of how hippocampal activity relates to ongoing behavioral performance (on ultrashort time scales).

      We changed the introduction to the following.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions. Importantly, the hippocampus is not traditionally thought to support sensorimotor functions, and its contributions to memory formation are typically discussed for longer time scales (hours, days, weeks). Here, however, we characterize in detail the relationship between hippocampal activity and real-time behavioral performance in a fast-paced timing task, which is traditionally believed to be hippocampal-independent. We propose that the capacity of the hippocampus to encode statistical regularities of our environment (Doeller et al. 2005, Shapiro et al. 2017, Behrens et al., 2018; Momennejad, 2020; Whittington et al., 2020) situates it at the core of a brain-wide network balancing specificity vs. regularization in real time as the relevant behavior is performed.

      5) The authors used three different regressors for the three feedback levels, asopposed to a parametric regressor indexing the level of feedback. The predictions are parametric, so a parametric regressor would be a better match, and would allow for the use of all the medium-accuracy data.

      The reviewer raises a good point that overlaps with question 3 by reviewer 2. In the current analysis, we model the three feedback levels with three independent regressors (high, medium, low accuracy). We then contrast high vs. low accuracy feedback, obtaining the results shown in Fig. 2AB. The beta estimates obtained for medium-accuracy feedback are being ignored in this contrast. Following the reviewer’s feedback, we therefore re-run the model, this time modeling all three feedback levels in one parametric regressor. All other regressors in the model stayed the same. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor.

      The results we observed were highly consistent across the two analyses, and all conclusions presented in the initial manuscript remain unchanged. While the exact t-scores differ slightly, we replicated the effects for all clusters on the voxel-wise map (on whole-brain FWE-corrected levels) as well as for the regions-of-interest analysis for anterior and posterior hippocampus. These results are presented in a new Supplementary Figure 3C.

      Note that the new Supplementary Figure 3B shows another related new analyses we conducted in response to question 4 of reviewer 2. Here, we re-ran the initial analysis with three feedback regressors, but without modeling the inter-trial interval (ITI) and the inter-session interval (ISI, i.e. the breaks participants took) to avoid model over-specification. Again, we replicated the results for all clusters and the ROI analysis, showing that the initial results we presented are robust.

      The following additions were made to the manuscript.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using the beta estimates obtained for the parametric feedback regressor (Fig. 2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      6) The authors claim that the results support the idea that the hippocampus is findingan "optimal trade-off between specificity and regularization". This seems overly speculative given the results presented.

      We understand the reviewer's skepticism about this statement and agree that the manuscript does not show that the hippocampus is finding the trade-off between specificity and regularization. However, this is also not exactly what the manuscript claims. Instead, it suggests that the hippocampus “may contribute” to solving this trade-off (page 3) as part of a “brain-wide network“ (pages 2,3,9,12). We also state that “Our [...] results suggest that this trade-off [...] is governed by many regions, updating different types of task information in parallel” (Page 11). To us, these phrasings are not equivalent, because we do not think that the role of the hippocampus in sensorimotor updating (or in any process really) can be understood independently from the rest of the brain. We do however think that our results are in line with the idea that the hippocampus contributes to solving this trade-off, and that this is exciting and surprising given the sensorimotor nature of our task, the ultrashort time scale of the underlying process, and the relationship to behavioral performance. We tried expressing that some of the points discussed remain speculation, but it seems that we were not always successful in doing so in the initial submission. We apologize for the misunderstanding, adapted corresponding statements in the manuscript, and we express even more carefully that these ideas are speculation.

      Following changes were made to the introduction and discussion.

      Page 2: Here, we approach this question with a new perspective by converging two parallel lines of research centered on sensorimotor timing and hippocampal-dependent cognitive mapping. Specifically, we test how the human hippocampus, an area often implicated in episodic-memory formation (Schiller et al., 2015; Eichenbaum, 2017), may support the flexible updating of sensorimotor representations in real time and in concert with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      Page 12: This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 13: This is in line with the notion that the hippocampus [...] supports finding an optimal trade off between specificity and regularization along with other regions. [...] Our results show that the hippocampus supports rapid and feedback-dependent updating of sensorimotor representations, suggesting that it is a central component of a brain-wide network balancing task specificity vs. regularization for flexible behavior in humans.

      Note that in response to comment 1 by reviewer 2, the revised manuscript now reports the results of additional behavioral analyses that support the notion that participants find an optimal trade-off between specificity and regularization over time (independent of whether the hippocampus was involved or not).

      7) The authors find that hippocampal activity is related to behavioral improvement fromthe prior trial. This seems to be a simple learning effect (participants can learn plenty about this task from a prior trial that does not have the exact same timing as the current trial) but is interpreted as sensitivity to temporal context. The temporal context framing seems too far removed from the analyses performed.

      We agree with the reviewer that our observation that hippocampal activity reflects TTC-independent behavioral improvements across trials could have multiple explanations. Critically, i) one of them is that the hippocampus encodes temporal context, ii) it is only one of multiple observations that we build our interpretation on, and iii) our interpretation builds on multiple earlier reports

      Interval estimates regress toward the mean of the sampled intervals, an effect that is often referred to as the “regression effect”. This effect, which we observed in our data too (Fig. 1B), has been proposed to reflect the encoding of temporal context (e.g. Jazayeri & Shadlen 2010). Moreover, there is a large body of literature on how the hippocampus may support the encoding of spatial and temporal context (e.g. see Bellmund, Polti & Doeller 2020 for review).

      Because both hippocampal activity and the regression effect were linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. If so, one would expect that hippocampal activity should reflect behavioral improvements independently from TTC, it should reflect the magnitude of the regression effect, and it should generally reflect feedback, because it is the feedback that informs the updating of the internal prior.

      All three observations may have independent explanations indeed, but they are all also in line with the idea that the hippocampus does encode temporal context and that this explains the relationship between hippocampal activity and the regression effect. It therefore reflects a sparse and reasonable explanation in our opinion, even though it necessarily remains an interpretation. Of course, we want to be clear on what our results are and what our interpretations are.

      In response to the reviewer’s comment, we therefore toned down two of the statements that mention temporal context in the manuscript, and we removed an overly speculative statement from the result section. In addition, the discussion now describes more clearly how our results are in line with this interpretation.

      Abstract: This is in line with the idea that the hippocampus supports the rapid encoding of temporal context even on short time scales in a behavior-dependent manner.

      Page 13: This is in line with the notion that the hippocampus encodes temporal context in a behavior-dependent manner, and that it supports finding an optimal trade off between specificity and regularization along with other regions.

      Page 12: Because hippocampal activity (Julian & Doeller, 2020) and the regression effect (Jazayeri & Shadlen, 2010) were previously linked to the encoding of (temporal) context, we reasoned that hippocampal activity should also be related to the regression effect directly. This may explain why hippocampal activity reflected the magnitude of the regression effect as well as behavioral improvements independently from TTC, and why it reflected feedback, which informed the updating of the internal prior.

      The following statement was removed, overlapping with comment 2 by Reviewer 3:

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      8) I am not sure the term "extraction of statistical regularities" is appropriate. The termis typically used for more complex forms of statistical relationships.

      We agree with the reviewer that this expression may be interpreted differently by different readers and are grateful to be pointed to this fact. We therefore removed it and instead added the following (hopefully less ambiguous) statement to the manuscript.

      Page 9: This study investigated how the human brain flexibly updates sensorimotor representations in a feedback-dependent manner in the service of timing behavior.

      Reviewer #2 (Public Review):

      The authors conducted a study involving functional magnetic resonance imaging and a time-to-contact estimation paradigm to investigate the contribution of the human hippocampus (HPC) to sensorimotor timing, with a particular focus on the involvement of this structure in specific vs. generalized learning. Suggestive of the former, it was found that HPC activity reflected time interval-specific improvements in performance while in support of the latter, HPC activity was also found to signal improvements in performance, which were not specific to the individual time intervals tested. Based on these findings, the authors suggest that the human HPC plays a key role in the statistical learning of temporal information as required in sensorimotor behaviour.

      By considering two established functions of the HPC (i.e., temporal memory and generalization) in the context of a domain that is not typically associated with this structure (i.e., sensorimotor timing), this study is potentially important, offering novel insight into the involvement of the HPC in everyday behaviour. There is much to like about this submission: the manuscript is clearly written and well-crafted, the paradigm and analyses are well thought out and creative, the methodology is generally sound, and the reported findings push us to consider HPC function from a fresh perspective. A relative weakness of the paper is that it is not entirely clear to what extent the data, at least as currently reported, reflects the involvement of the HPC in specific and generalized learning. Since the authors' conclusions centre around this observation, clarifying this issue is, in my opinion, of primary importance.

      We thank the reviewer for these positive and extremely helpful comments, which we will address in detail below. In response to these comments, the revised manuscript clarifies why the observed performance improvements are not at odds with the idea that an optimal trade-off between specificity and regularization is found, and how the time course of learning relates to those reported in previous literature. In addition, we conducted two new fMRI analyses, ensuring that our conclusions remain unchanged even if feedback is modeled with one parametric regressor, and if the number or nuisance regressors is reduced to control for overparameterization of the model. Please find our responses underneath each individual point below.

      1) Throughout the manuscript, the authors discuss the trade-off between specific and generalized learning, and point towards Figure S1D as evidence for this (i.e., participants with higher TTC accuracy exhibited a weaker regression effect). What appears to be slightly at odds with this, however, is the observation that the deviation from true TTC decreased with time (Fig S1F) as the regression line slope approached 0.5 (Fig S1E) - one would have perhaps expected the opposite i.e., for deviation from true TTC to increase as generalization increases. To gain further insight into this, it would be helpful to see the deviation from true TTC plotted for each of the four TTC intervals separately and as a signed percentage of the target TTC interval (i.e., (+) or (-) deviation) rather than the absolute value.

      We thank the reviewer for raising this important question and for the opportunity to elaborate on the relationship between the TTC error and the magnitude of the regression effect in behavior. Indeed, we see that the regression slopes approach 0.5 and that the TTC error decreases over the course of the experiment. We do not think that these two observations are at odds with each other for the following reasons:

      First, while the reviewer is correct in pointing out that the deviation from the TTC should increase as “generalization increases”, that is not what we found. It was not the magnitude of the regularization per se that increased over time, but the overall task performance became more optimal in the face of both objectives: specificity and generalization. This optimum is at a regression-line slope of 0.5. Generalization (or regularization how we refer to it in the present manuscript), therefore did not increase per se on group level.

      Second, the regression slopes approached 0.5 on the group-level, but the individual participants approached this level from different directions: Some of them started with a slope value close to 1 (high accuracy), whereas others started with a slope value close to 0 (near full regression to the mean). Irrespective of which slope value they started with, over time, they got closer to 0.5 (Rebuttal Figure 1A). This can also be seen in the fact that the group-level standard deviation in regression slopes becomes smaller over the course of the experiment (Rebuttal Figure 1B, SFig 1G). It is therefore not generally the case that the regression effect becomes stronger over time, but that it becomes more optimal for longer-term behavioral performance, which is then also reflected in an overall decrease in TTC error. Please see our response to the reviewer’s second comment for more discussion on this.

      Third, the development of task performance is a function of two behavioral factors: a) the accuracy and b) the precision in TTC estimation. Accuracy describes how similar the participant’s TTC estimates were to the true TTC, whereas precision describes how similar the participant’s TTC estimates were relative to each other (across trials). Our results are a reflection of the fact that participants became both more accurate over time on average, but also more precise. To demonstrate this point visually, we now plotted the Precision and the Accuracy for the 8 task segments below (Rebuttal Figure 1C, SFig 1H), showing that both measures increased as the time progressed and more trials were performed. This was the case for all target durations.

      In response to the reviewer’s comment, we clarified in the main text that these findings are not at odds with each other. Furthermore, we made clear that regularization per se did not increase over time on group level. We added additional supporting figures to the supplementary material to make this point. Note that in our view, these new analyses and changes more directly address the overall question the reviewer raised than the figure that was suggested, which is why we prioritized those in the manuscript.

      However, we appreciated the suggestion a lot and added the corresponding figure for the sake of completeness.

      Following additions were made.

      Page 5: In support of this, participants' regression slopes converged over time towards the optimal value of 0.5, i.e. the slope value between veridical performance and the grand mean (Fig. S1F; linear mixed-effects model with task segment as a predictor and participants as the error term, F(1) = 8.172, p = 0.005, ε2=0.08, CI: [0.01, 0.18]), and participants' slope values became more similar (Fig. S1G; linear regression with task segment as predictor, F(1) = 6.283, p = 0.046, ε2 = 0.43, CI: [0, 1]). Consequently, this also led to an improvement in task performance over time on group level (i.e. task accuracy and precision increased (Fig. S1I), and the relationship between accuracy and precision became stronger (Fig. S1H), linear mixed-effect model results for accuracy: F(1) = 15.127, p = 1.3x10-4, ε2=0.06, CI: [0.02, 0.11], precision: F(1) = 20.189, p = 6.1x10-5, ε2 = 0.32, CI: [0.13, 1]), accuracy-precision relationship: F(1) = 8.288, p =0.036, ε2 = 0.56, CI: [0, 1], see methods for model details).

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      Page 15: We also corroborated this effect by measuring the dispersion of slope values between participants across task segments using a linear regression model with task segment as a predictor and the standard deviation of slope values across participants as the dependent variable (Fig. S1G). As a measure of behavioral performance, we computed two variables for each target-TTC level: sensorimotor timing accuracy, defined as the absolute difference in estimated and true TTC, and sensorimotor timing precision, defined as coefficient of variation (standard deviation of estimated TTCs divided by the average estimated TTC). To study the interaction between these two variables for each target TTC over time, we first normalized accuracy by the average estimated TTC in order to make both variables comparable. We then used a linear mixed-effects model with precision as the dependent variable, task segment and normalized accuracy as predictors and target TTC as the error term. In addition, we tested whether accuracy and precision increased over the course of the experiment using separate linear mixed-effects models with task segment as predictor and participants as the error term.

      2) Generalization relies on prior experience and can be relatively slow to develop as is the case with statistical learning. In Jazayeri and Shadlen (2010), for instance, learning a prior distribution of 11-time intervals demarcated by two briefly flashed cues (compared to 4 intervals associated with 24 possible movement trajectories in the current study) required ~500 trials. I find it somewhat surprising, therefore, that the regression line slope was already relatively close to 0.5 in the very first segment of the task. To what extent did the participants have exposure to the task and the target intervals prior to entering the scanner?

      We thank the reviewer for raising the important question about the time course of learning in our task and how our results relate to prior work on this issue. Addressing the specific reviewer question first, participants practiced the task for 2-3 minutes prior to scanning. During the practice, they were not specifically instructed to perform the task as well as they could nor to encode the intervals, but rather to familiarize themselves with the general experimental setup and to ask potential questions outside the MRI machine. While they might have indeed started encoding the prior distribution of intervals during the practice already, we have no way of knowing, and we expect the contribution of this practice on the time course of learning during scanning to be negligible (for the reasons outlined above).

      However, in addition to the specific question the reviewer asked, we feel that the comment raises two more general points: 1) How long does it take to learn the prior distribution of a set of intervals as a function of the number of intervals tested, and 2) Why are the learning slopes we report quite shallow already in the beginning of the scan?

      Regarding (1), we are not aware of published reports that answer this question directly, and we expect that this will depend on the task that is used. Regarding the comparison to Jazayeri & Shadlen (2010), we believe the learning time course is difficult to compare between our study and theirs. As the reviewer mentioned, our study featured only 4 intervals compared to 11 in their work, based on which we would expect much faster learning in our task than in theirs. We did indeed sample 24 movement directions, but these were irrelevant in terms of learning the interval distribution. Moreover, unlike Jazayeri & Shadlen (2010), our task featured moving stimuli, which may have added additional sensory, motor and proprioceptive information in our study which the participants of the prior study could not rely on.

      Regarding (2), and overlapping with the reviewer’s previous comment, the average learning slope in our study is indeed close to 0.5 already in the first task segment, but we would like to highlight that this is a group-level measure. The learning slopes of some subjects were closer to 1 (i.e. the diagonal in Fig 1B), and the one of others was closer to 0 (i.e. the mean) in the beginning of the experiment. The median slope was close to 0.65. Importantly, the slopes of most participants still approached 0.5 in the course of the experiment, and so did even the group-level slope the reviewer is referring to. This also means that participants’ slopes became more similar in the course of the experiment, and they approached 0.5, which we think reflects the optimal trade-off between regressing towards the mean and regressing towards the diagonal (in the data shown in Fig. 1B). This convergence onto the optimal trade-off value can be seen in many measures, including the mean slope (Rebuttal Figure 1A, SFig 1F), the standard deviation in slopes (Rebuttal Figure 1B, SFig 1G) as well as the Precision vs. Accuracy tradeoff (Rebuttal Figure 1C, SFig 1H). We therefore think that our results are well in line with prior literature, even though a direct comparison remains difficult due to differences in the task.

      In response to the reviewer’s comment, and related to their first comment, we made the following addition to the discussion section.

      Page 12: This suggests that different regions encode distinct task regularities in parallel to form optimal sensorimotor representations to balance specificity and regularization. This is well in line with our behavioral results, showing that TTC-task performance became more optimal in the face of both of these two objectives. Over time, behavioral responses clustered more closely between the diagonal and the average line in the behavioral response profile (Fig. 1B, S1G), and the TTC error decreased over time. While different participants approached these optimal performance levels from different directions, either starting with good performance or strong regularization, the group approached overall optimal performance levels over the course of the experiment.

      3) I am curious to know whether differences between high-accuracy andmedium-accuracy feedback as well as between medium-accuracy and low-accuracy feedback predicted hippocampal activity in the first GLM analysis (middle page 5). Currently, the authors only present the findings for the contrast between high-accuracy and low-accuracy feedback. Examining all feedback levels may provide additional insight into the nature of hippocampal involvement and is perhaps more consistent with the subsequent GLM analysis (bottom page 6) in which, according to my understanding, all improvements across subsequent trials were considered (i.e., from low-accuracy to medium-accuracy; medium-accuracy to high-accuracy; as well as low-accuracy to high-accuracy).

      We thank the reviewer for this thoughtful question, which relates to questions 5 by reviewer 1. The reviewer is correct that the contrast shown in Fig 2 does not consider the medium-accuracy feedback levels, and that the model in itself is slightly different from the one used in the subsequent analysis presented in Fig. 3. To reply to this comment as well as to a related one by reviewer 1 together, we therefore repeated the full analysis while modeling the three feedback levels in one parametric regressor, which includes the medium-accuracy feedback trials, and is consistent with the analysis shown in Fig. 3. The results of this new analysis are presented in the new Supplementary Fig. 3B.

      In short, the model included one parametric regressor with three levels reflecting the three types of feedback, and all nuisance regressors remained unchanged. Instead of contrasting high vs. low accuracy feedback, we then performed voxel-wise t-tests on the beta estimates obtained for the parametric feedback regressor. We found that our results presented initially were very robust: Both the observed clusters in the voxel-wise analysis (on whole-brain FWE-corrected levels) as well as the ROI results replicated across the two analyses, and our conclusions therefore remain unchanged.

      We made multiple textual additions to the manuscript to include this new analysis, and we present the results of the analysis including a direct comparison to our initial results in the new Supplementary Fig. 3. Following textual additions were.

      Page 5: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Page 17: Moreover, instead of modeling the three feedback levels with three independent regressors, we repeated the analysis modeling the three feedback levels as one parametric regressor with three levels. All other regressors remained unchanged, and the model included the regressors for ITIs and ISIs. We then conducted t-tests implemented in SPM12 using thebeta estimates obtained for the parametric feedback regressor (Fig. S2C). Compared to the initial analyses presented above, this has the advantage that medium-accuracy feedback trials are considered for the statistics as well.

      4) The authors modeled the inter-trial intervals and periods of rest in their univariateGLMs. This approach of modelling all 'down time' can lead to model over-specification and inaccurate parameter estimation (e.g. Pernet, 2014). A comment on this approach as well as consideration of not modelling the inter-trial intervals would be useful.

      This is an important issue that we did not address in our initial manuscript. We are aware and agree with the reviewer’s general concern about model over-specification, which can be a big problem in regression as it leads to biased estimates. We did examine whether our model was overspecified before running it, but we did not report a formal test of it in the manuscript. We are grateful to be given the opportunity to do so now.

      In response to the reviewer’s comment, we repeated the full analysis shown in Fig. 2 while excluding the nuisance regressors for inter-trial intervals (ISI) and breaks (or inter-session intervals, ISI). All other regressors and analysis steps stayed unchanged relative to the one reported in Fig. 2. The new results are presented in a new Supplementary Figure 3B.

      Like for our previous analysis, we again see that the results we initially presented were extremely robust even on whole-brain FWE corrected levels, as well as on ROI level. Our conclusions therefore remain unchanged, and the results we presented initially are not affected by potential model overspecification. In addition to the new Supplementary Figure 3B, we made multiple textual changes to the manuscript to describe this new analysis and its implications. Note that we used the same nuisance regressors in all other GLM analyses too, meaning that it is also very unlikely that model overspecification affects any of the other results presented. We thank the reviewer for suggesting this analysis, and we feel including it in the manuscript has further strengthened the points we initially made.

      Following additions were made to the manuscript.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI) [...]

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. All other regressors including the main feedback regressors of interest remained unchanged, and we repeated both the voxel-wise and ROI-wise statistical tests as described above (Fig. S2B).

      Page 17: Note that these results were robust even when fewer nuisance regressors were included to control for model over-specification (Fig. S3B; two-tailed one-sample t tests: anterior HPC, t(33) = -3.65, p = 8.9x10-4, pfwe = 0.002, d=-0.63, CI: [-1.01, -0.26]; posterior HPC, t(33) = -1.43, p = 0.161, pfwe = 0.322, d=-0.25, CI: [-0.59, 0.10]), and when all three feedback levels were modeled with one parametric regressors (Fig. S3C; two-tailed one-sample t tests: anterior HPC, t(33) = -3.59, p = 0.002, pfwe = 0.005, d=-0.56, CI: [-0.93, -0.20]; posterior HPC, t(33) = -0.99, p = 0.329, pfwe = 0.659, d=-0.17, CI: [-0.51, 0.17]). Further, there was no systematic relationship between subsequent trials on a behavioral level [...]

      Reviewer #3 (Public Review):

      This paper reports the results of an interesting fMRI study examining the neural correlates of time estimation with an elegant design and a sensorimotor timing task. Results show that hippocampal activity and connectivity are modulated by performance on the task as well as the valence of the feedback provided. This study addresses a very important question in the field which relates to the function of the hippocampus in sensorimotor timing. However, a lack of clarity in the description of the MRI results (and associated methods) currently prevents the evaluation of the results and the interpretations made by the authors. Specifically, the model testing for timing-specific/timing-independent effects is questionable and needs to be clarified. In the current form, several conclusions appear to not be fully supported by the data.

      We thank the reviewer for pointing us to many methodological points that needed clarification. We apologize for the confusion about our methods, which we clarify in the revised manuscript. Please find our responses to the individual points below.

      Major points

      Some methodological points lack clarity which makes it difficult to evaluate the results and the interpretation of the data.

      We really appreciate the many constructive comments below. We feel that clarifying these points improved our manuscript immensely.

      1) It is unclear how the 3 levels of accuracy and feedback (high, medium, and lowperformance) were computed. Please provide the performance range used for this classification. Was this adjusted to the participants' performance?

      The formula that describes how the response window was computed for the different speed levels was reported in the methods section of the original manuscript on page 13. It reads as follows:

      “The following formula was used to scale the response window width: d ± ((k ∗ d)/2) where d is the target TTC and k is a constant proportional to 0.3 and 0.6 for high and medium accuracy, respectively.“

      In response to the reviewer’s comment, we now additionally report the exact ranges of the different response windows in a new Supplementary Table 1 and refer to it in the Methods section as follows.

      Page 10: To calibrate performance feedback across different TTC durations, the precise response window widths of each feedback level scaled with the speed of the fixation target (Table S1).

      2) The description of the MRI results lacks details. It is not always clear in the resultssection which models were used and whether parametric modulators were included or not in the model. This makes the results section difficult to follow. For example,

      a) Figure 2: According to the description in the text, it appears that panels A and B report the results of a model with 3 regressors, ie one for each accuracy/feedback level (high, medium, low) without parametric modulators included. However, the figure legend for panel B mentions a parametric modulator suggesting that feedback was modelled for each trial as a parametric modulator. The distinction between these 2 models must be clarified in the result section.

      We thank the reviewer very much for spotting this discrepancy. Indeed, Figure 2 shows the results obtained for a GLM in which we modeled the three feedback levels with separate regressors, not with one parametric regressor. Instead, the latter was the case for Figure 3. We apologize for the confusion and corrected the description in the figure caption, which now reads as follows. The description in the main text and the methods remain unchanged.

      Caption Fig. 2: We plot the beta estimates obtained for the contrast between high vs. low feedback.

      Moreover, note that in response to comment 5 by reviewer 1 and comment 3 by reviewer 2, the revised manuscript now additionally reports the results obtained for the parametric regressor in the new Supplementary Figure 3C. All conclusions remain unchanged.

      Additionally, it is unclear how Figure 2A supports the following statement: "Moreover, the voxel-wise analysis revealed similar feedback-related activity in the thalamus and the striatum (Fig. 2A), and in the hippocampus when the feedback of the current trial was modeled (Fig. S3)." This is confusing as Figure 2A reports an opposite pattern of results between the striatum/thalamus and the hippocampus. It appears that the statement highlighted above is supported by results from a model including current trial feedback as a parametric modulator (reported in Figure S3).

      We agree with the reviewer that our result description was confusing and changed it. It now reads as follows.

      Page 5: Moreover, the voxel-wise analysis revealed feedback-related activity also in the thalamus and the striatum (Fig. 2A) [...]

      Also, note that it is unclear from Figure 2A what is the direction of the contrast highlighting the hippocampal cluster (high vs. low according to the text but the figure shows negative values in the hippocampus and positive values in the thalamus). These discrepancies need to be addressed and the models used to support the statements made in the results sections need to be explicitly described.

      The description of the contrast is correct. Negative values indicate smaller errors and therefore better feedback, which is mentioned in the caption of Fig. 2 as follows:

      “Negative values indicate that smaller errors, and higher-accuracy feedback, led to stronger activity.”

      Note that the timing error determined the feedback, and that we predicted stronger updating and therefore stronger activity for larger errors (similar to a prediction error). We found the opposite. We mention the reasoning behind this analysis at various locations in the manuscript e.g. when talking about the connectivity analysis:

      “We reasoned that larger timing errors and therefore low-accuracy feedback would result in stronger updating compared to smaller timing errors and high-accuracy feedback”

      In response to the reviewer’s remark, we clarified this further by adding the following statement to the result section.

      Page 5: “Using a mass-univariate general linear model (GLM), we modeled the three feedback levels with one regressor each plus additional nuisance regressors (see methods for details). The three feedback levels (high, medium and low accuracy) corresponded to small, medium and large timing errors, respectively. We then contrasted the beta weights estimated for high-accuracy vs. low-accuracy feedback and examined the effects on group-level averaged across runs.”

      b) Connectivity analyses: It is also unclear here which model was used in the PPIanalyses presented in Figure 2. As it appears that the seed region was extracted from a high vs. low contrast (without modulators), the PPI should be built using the same model. I assume this was the case as the authors mentioned "These co-fluctuations were stronger when participants performed poorly in the previous trial and therefore when they received low-accuracy feedback." if this refers to low vs. high contrast. Please clarify.

      Yes, the PPI model was built using the same model. We clarified this in the methods section by adding the following statement to the PPI description.

      Page 17: “The PPI model was built using the same model that revealed the main effects used to define the HPC sphere “

      Yes, the reviewer is correct in thinking that the contrast shows the difference between low vs. high-accuracy feedback. We clarified this in the main text as well as in the caption of Fig. 2.

      Caption Fig 2: [...] We plot results of a psychophysiological interactions (PPI) analysis conducted using the hippocampal peak effects in (A) as a seed for low vs. high-accuracy feedback. [...]

      Page 17: The estimated beta weight corresponding to the interaction term was then tested against zero on the group-level using a t-test implemented in SPM12 (Fig. 2C). The contrast reflects the difference between low vs. high-accuracy feedback. This revealed brain areas whose activity was co-varying with the hippocampus seed ROI as a function of past-trial performance (n-1).

      c) It is unclear why the model testing TTC-specific / TTC-independent effects (resultspresented in Figure 3) used 2 parametric modulators (as opposed to building two separate models with a different modulator each). I wonder how the authors dealt with the orthogonalization between parametric modulators with such a model. In SPM, the orthogonalization of parametric modulators is based on the order of the modulators in the design matrix. In this case, parametric modulator #2 would be orthogonalized to the preceding modulator so that a contrast focusing on the parametric modulator #2 would highlight any modulation that is above and beyond that explained by modulator #1. In this case, modulation of brain activity that is TTC-specific would have to be above and beyond a modulation that is TTC-independent to be highlighted. I am unsure that this is what the authors wanted to test here (or whether this is how the MRI design was built). Importantly, this might bias the interpretation of their results as - by design - it is less likely to observe TTC-specific modulations in the hippocampus as there is significant TTC-independent modulation. In other words, switching the order of the modulators in the model (or building two separate models) might yield different results. This is an important point to address as this might challenge the TTC-specific/TTC-independent results described in the manuscript.

      We thank the reviewer for raising this important issue. When running the respective analysis, we made sure that the regressors were not collinear and we therefore did not expect substantial overlap in shared variance between them. However, we agree with the reviewer that orthogonalizing one regressor with respect to the other could still affect the results. To make sure that our expectations were indeed met, we therefore repeated the main analysis twice: 1) switching the order of the modulators and 2) turning orthogonalization off (which is possible in SPM12 unlike in previous versions). In all cases, our key results and conclusions remained unchanged, including the central results of the hippocampus analyses.

      Anterior (ant.) / Posterior (post.) Hippocampus ROI analysis with A) original order of modulators, B) switching the order of the modulators and C) turning orthogonalization of modulators off. ABC) Orange color corresponds to the TTC-independent condition whereas light-blue color corresponds to the TTC-specific condition. Statistics reflect p<0.05 at Bonferroni corrected levels () obtained using a group-level one-tailed one-sample t-test against zero; A) pfwe = 0.017, B) pfwe = 0.039, C) pfwe = 0.039.*

      Because orthogonalization did not affect the conclusions, the new manuscript simply reports the analysis for which it was turned off. Note that these new figures are extremely similar to the original figures we presented, which can be seen in the exemplary figure below showing our key results at a liberal threshold for transparency. In addition, we clarified that orthogonalization was turned off in the methods section as follows.

      Page 18: These two regressors reflect the tests for target-TTC-independent and target-TTC-specific updating, respectively, and they were not orthogonalized to each other.

      Comparison of old & new results: also see Fig. 3 and Fig. S5 in manuscript

      d) It is also unclear how the behavioral improvement was coded/classified "wecontrasted trials in which participants had improved versus the ones in which they had not improved or got worse"- It appears that improvement computation was based on the change of feedback valence (between high, medium and low). It is unclear why performance wasn't used instead? This would provide a finer-grained modulation?

      We thank the reviewer for the opportunity to clarify this important point. First, we chose to model feedback because it is the feedback that determines whether participants update their “internal model” or not. Without feedback, they would not know how well they performed, and we would not expect to find activity related to sensorimotor updating. Second, behavioral performance and received feedback are tightly correlated, because the former determines the latter. We therefore do not expect to see major differences in results obtained between the two. Third, we did in fact model both feedback and performance in two independent GLMs, even though the way the results were reported in the initial submission made it difficult to compare the two.

      Figure 4 shows the results obtained when modeling behavioral performance in the current trial as an F-contrast, and Supplementary Fig 4 shows the results when modeling the feedback received in the current trial as a t-contrast. While the voxel-wise t-maps/F-maps are also quite similar, we now additionally report the t-contrast for the behavioral-performance GLM in a new Supplementary Figure 4C. The t-maps obtained for these two different analyses are extremely similar, confirming that the direction of the effects as well as their interpretation remain independent of whether feedback or performance is modeled.

      The revised manuscript refers to the new Supplementary Figure 4C as follows.

      Page 17: In two independent GLMs, we analyzed the time courses of all voxels in the brain as a function of behavioral performance (i.e. TTC error) in each trial, and as a function of feedback received at the end of each trial. The models included one mean-centered parametric regressor per run, modeling either the TTC error or the three feedback levels in each trial, respectively. Note that the feedback itself was a function of TTC error in each trial [...] We estimated weights for all regressors and conducted a t-test against zero using SPM12 for our feedback and performance regressors of interest on the group level (Fig. S4A). [...]

      Page 17: In addition to the voxel-wise whole-brain analyses described above, we conducted independent ROI analyses for the anterior and posterior sections of the hippocampus (Fig. S2A). Here, we tested the beta estimates obtained in our first-level analysis for the feedback and performance regressors of interest (Fig. S4B; two-tailed one-sample t tests: anterior HPC, t(33) = -5.92, p = 1.2x10-6, pfwe = 2.4x10-6, d=-1.02, CI: [-1.45, -0.6]; posterior HPC, t(33) = -4.07, p = 2.7x10-4, pfwe = 5.4x10-4, d=-0.7, CI: [-1.09, -0.32]). See section "Regions of interest definition and analysis" for more details.

      If the feedback valence was used to classify trials as improved or not, how was this modelled (one regressor for improved, one for no improvement? As opposed to a parametric modulator with performance improvement?).

      We apologize for the lack of clarity regarding our regressor design. In response to this comment, we adapted the corresponding paragraph in the methods to express more clearly that improvement trials and no-improvement trials were modeled with two separate parametric regressors - in line with the reviewer’s understanding. The new paragraph reads as follows.

      Page 18: One regressor modeled the main effect of the trial and two parametric regressors modeled the following contrasts: Parametric regressor 1: trials in which behavioral performance improved \textit{vs}. parametric regressor 2: trials in which behavioral performance did not improve or got worse relative to the previous trial.

      Last, it is also unclear how ITI was modelled as a regressor. Did the authors mean a parametric modulator here? Some clarification on the events modelled would also be helpful. What was the onset of a trial in the MRI design? The start of the trial? Then end? The onset of the prediction time?

      The Inter-trial intervals (ITIs) were modeled as a boxcar regressor convolved with the hemodynamic response function. They describe the time after the feedback-phase offset and the subsequent trial onset. Moreover, the start of the trial was the moment when the visual-tracking target started moving after the ITI, whereas the trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The onset of the “prediction time” was the moment in which the visual-tracking target stopped moving, prompting participants to estimate the time-to-contact. We now explain this more clearly in the methods as shown below.

      Page 16: The GLM included three boxcar regressors modeling the feedback levels, one for ITIs, one for button presses and one for periods of rest (inter-session interval, ISI), which were all convolved with the canonical hemodynamic response function of SPM12. The start of the trial was considered as the trial onsets for modeling (i.e. the time when the visual-tracking target started moving). The trial end was the offset of the feedback phase (i.e. the moment in which the feedback disappeared from the screen). The ITI was the time between the offset of the feedback-phase and the subsequent trial onset.

      On a related note, in response to question 4 by reviewer 2, we now repeated one of the main analyses (Fig. 2) without modeling the ITI (as well as the Inter-session interval, ISI). We found that our key results and conclusions are independent of whether or not these time points were modeled. These new results are presented in the new Supplementary Figure 3B.

      Page 16: ITIs and ISIs were modeled to reduce task-unrelated noise, but to ensure that this did not lead to over-specification of the above-described GLM, we repeated the full analysis without modeling the two. [...]

      1. Perhaps as a result of a lack of clarity in the result section and the MRI methods, it appears that some conclusions presented in the result section are not supported by the data. E.g. "Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time." The data show that hippocampal activity is higher during and after an accurate trial. This pattern of results could be attributed to various processes such as e.g. reward or learning etc. I would recommend not providing such interpretations in the result section and addressing these points in the discussion.

      Similar to above, statements like "These results suggest that the hippocampus updates information that is independent of the target TTC". The data show that higher hippocampal activity is linked to greater improvement across trials independent of the timing of the trial. The point about updating is rather speculative and should be presented in the discussion instead of the result section.

      The reviewer is referring to two statements in the results section that reflect our interpretation rather than a description of the results. In response to the reviewer’s comment, we therefore removed the following statement from the results.

      Instead, these results are consistent with the notion that hippocampal activity signals the updating of task-relevant sensorimotor representations in real-time.

      In addition, we replaced the remaining statement by the following. We feel this new statement makes clear why we conducted the analysis that is described without offering an interpretation of the results that were presented before.

      Page 8: We reasoned that updating TTC-independent information may support generalization performance by means of regularizing the encoded intervals based on the temporal context in which they were encoded.

    1. Author Response

      Reviewer #1 (Public Review):

      7) Can the primary cells in Figure 2E and AML#1 and AML#2 be studied for mTORC1 activity by Western, as in 2D?

      For reasons that we do not understand, we have been unable to effectively culture primary FLT3-ITD AMLs, despite being able to culture most other AMLs for weeks. This issue has prevented us from being able to perform biochemical analyses of FLT3-ITD AMLs in response to FLT3 inhibition.

      8) Additional genetic information should be provided if possible for the primary AML cells - what other mutations in addition to FLT3 were present? Were there any mTOR pathway alterations?

      We provided the other mutations of AML#1 sample (NPM1 mutation) in the section METHODS-Therapeutic modeling in mice, as well as Figure legends 2E and 3D. There were no evident alterations in the mTOR pathway (beyond the FLT3-ITD mutation).

    1. Author Response:

      We thank the reviewers for their thoughtful critiques and helpful suggestions for how to improve our manuscript. Described below is our response clarifying a number of issues raised by the reviewers.

      We agree with the reviewer that we cannot definitively conclude that the first division chromosome segregation defects and the later mid-blastula transition CI-induced defects are the result of distinct mechanisms. In fact, we raise this possibility in the discussion. However, our finding that the CI phenotype induces a temporally and developmentally deferred chromosome segregation defects in the late blastoderm divisions (in addition to the well-studied first division defect) alters the established view of the CI phenotype and must be taken into account when considering mechanisms of CI. Our current view is that the distinct early and late defects could be caused by either 1) a common mechanism (possibly a chromosome mark/defect inherited through the early blastoderm divisions causing segregation defects in the late blastoderm divisions) or 2) distinct early and late mechanisms that do not strictly “depend” upon one another. We have clarified this point in the revised manuscript.

      We disagree with the reviewer that this result is to be expected given previous studies. In D. simulans, a small percentage of embryos derived from the CI cross hatch. These embryos are thought to have bypassed the first division defect. It is not obvious why there must be late defects in these embryos that “escape” early CI-induced defects and subsequently hatch. Previous studies interpreted embryos that exhibit late division errors as those that have lost their entire paternal complement of chromosomes as a result of strong CI-induced defects during the first mitotic division and develop as maternal haploids. These studies, including transgene- induced CI, have focused primarily on embryos that have undergone the first mitotic division embryonic defects. To the best of our knowledge, no group has thoroughly examined embryos that progress normally through the pre-cortical cycles 2-9 as performed in this manuscript. Thus, it was entirely unexpected that these embryos would exhibit the mitotic defects during the late blastoderm divisions and the MBT. We discuss how this finding requires modified current models for the mechanisms of CI.

      Regarding the comment that “the primary claim of the paper that later-stage embryos die for different reasons than early-stage embryos,” we make no such claim. In fact, we provide evidence that the failure to hatch (late embryonic lethality) is, at least in part, due to haploid development—a direct result of the first division CI defect. The focus of our studies are those CI-derived embryos that progress normally, maintain the normal complement of chromosomes through the first division, and exhibit chromosome segregation errors during the late blastoderm divisions. We do not know the fate of these embryos, and previous studies have demonstrated that embryos suffering extensive late blastoderm segregation errors are able to hatch (Sullivan, 1990, Development 110:311-323). We have clarified these points in the discussion.

      While we agree that transgenic tools have proven invaluable in the study of CI, they are not appropriate for these studies. The purpose of our study was to undertake an unbiased re-examination of the CI phenotype. Of necessity, the transgenic studies rely on exogenous host promoters rather than the natural endogenous Wolbachia/Prophage promoters. Thus, while informative, it is unlikely the that the transgenic alleles would capture all of the complexities and nuance of the CI phenotype. In addition, the transgenic studies, of which we are aware, have only interrogated a single pair of the CI-inducing genes, while the Wolbachia genome contains both Cid and Cin CI-associated gene pairs and possibly other yet-to-be-identified CI/Rescue genes.

      Our unbiased re-examination of the CI phenotype induced by W. riverside in D. simulans identified a previously unsuspected temporally and developmentally distinct set of CI-induced defects that occur during and after the mid-blastula transition. This finding must be taken into account when considering the mechanisms that cause CI. In our revisions, we clarify the above points and qualify our statements to appropriately interpret our results in context of the nuances and uncertainties of CI and early Drosophila embryogenesis.

    1. Author Response:

      Reviewer #3 (Public Review):

      The authors revealed the novel role of the DLL-4-Notch1-NICD signaling axis played in platelet activation, aggregation, and thrombus formation. They firstly confirmed the expression of Notch1 and DLL-4 in human platelets and demonstrated both Notch1 and DLL-4 were upregulated in response to thrombin stimulation. Further, they confirmed the exposure of human platelets with DLL-4 would lead to γ-secretase mediated NICD (a calpain substrate) release. Stimulating platelets with DLL-4 alone triggered platelet activation measured by integrin αIIbβ3 activation, P-selectin translocation, granule release, enhanced platelet-neutrophil and platelet-monocyte interactions, intracellular calcium mobilization, PEVs release, phosphorylation of cytosolic proteins, and PI3K and PKC activation. In addition, Susheel N. Chaurasia et al. showed that when platelets were stimulated with DLL-4 and low-dose thrombin, the Notch1 signaling can operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation. When the DLL-4-Notch1-NICD signaling axis was blocked by γ-secretase inhibitors, the platelets responding to stimulation were attenuated, and the arterial thrombosis in mice was impaired.

      This study by Susheel N. Chaurasia et al. was carefully designed and used multiple approaches to test their hypothesis. Their research raised the potential of targeting the DLL-4-Notch1-NICD signaling axis for anti-platelet and anti-thrombotic therapies. Interestingly, compared to thrombin, a potent physiological platelet agonist, the signaling cascade triggered by DLL-4 was relatively weak. Given that the upregulation of DLL-4 and Notch1 happened in response to thrombin stimulation, how much DLL-4 mediated signaling could contribute to in vivo platelet activation in the presence of thrombin is questionable. This could potentially limit the application of targeting Notch1 as an anti-thrombotic therapy. Further, the authors showed that Notch1 signaling could operate in a juxtacrine manner to potentiate low dose thrombin mediate platelet activation, which means the DLL-4 mediated platelet signaling can act as an accelerator of early-stage hemostasis. Again, inhibition of Notch1 may slow down the hemostasis process. But given the fact that there are other platelet agonists (ADP, collagen...) existing simultaneously, blocking Notch1 signaling may not have a strong anti-platelet effect.

      We concur with the Public Reviewer that, further study is needed to delineate extent of contribution of DLL-4 signaling in thrombin-activated platelets. However, it is now amply clear that Notch signaling plays a central role in development of thrombinactivated phenotype of platelets. Further, DLL-4-Notch1 interaction on surfaces of adjacent platelets within the thrombus reinforces platelet-platelet aggregate formation. This is further reflected from significant inhibition of thrombus formation in vivo in presence of DAPT in a mouse model of intravital thrombosis. Given that there is a lot of redundancy in stimulation of platelets employing different physiological agonists (ADP, collagen, thrombin etc.), none of the present-day drugs is fully capable of effective platelet inhibition due to parallel signaling pathways. Thus, discovery of Notch signaling and its seminal role in platelet activation could explain redundancy associated with anti-platelet drugs, and Notch inhibition could complement with existing anti-platelet regimen in evoking effective and complete platelet inhibition.